The 4th and 5th question types of CSP certification are like climbing Mount Everest, which is desperate. The third question type is like climbing Mount Tai, knowing that it can be solved, but it will always take hours of hard work. At the request of the students, try to do the third question of the CSP certification in March 2023: LDAP. For the questions, please refer to the mock exam on the CSP official website.
The pattern of the third question type is: parse data, build model, perform operation/answer query/transform output. The computational portion of this LDAP problem is Responding to queries.
The input of other question types is mostly numbers, which is relatively simple. The input of the third question type is usually text, and it is necessary to use the lexical analysis and grammatical analysis of the compilation principle to analyze the data.
v1: Basic expression (20%)
For 20% of the input… are atomic expressions…
BASE_EXPR = ATTRIBUTE OPERATOR VALUE
Realize the simple situation first, hit a rabbit with a waistband:
- Data structure: two-level mapping
map
.> - Algorithm: Brute force search for all users that has the attribute and the value meets the requirement (equal, unequal).
- The user (DN), attribute number (aid), and attribute value (value) do not exceed 1e9, and do not involve arithmetic calculations, so it can be represented by
int
. - The basic expression can be read directly by using
cin >> aid >> op >> val
. Among them,char op;
may be:
or~
, which means equal or equal. // Error 1 - The output user list needs to be sorted from small to large, so use
set
to save. // Mistake 2
Received 20 minutes soon after submitting: running error 20 2.015s 58.58MB
. Runtime errors are also as expected, as assertions are set for unimplemented feature branches.
#include <bits/stdc++.h> using namespace std; using db_t = map<int, map<int, int> >; // u: user, a: attribute db_t db; // database, {uid: {attrib_id : value} } set<int> eval_base() {<!-- --> // base expression evaluation int aid, x; char op; cin >> aid >> op >> x; set<int> usr; for(auto & amp; m : db) {<!-- --> const int DN = m.first; for(auto & amp; a : m.second) {<!-- --> if(a.first == aid) {<!-- --> // target attribute exists if((op == ':' & amp; & amp; a.second == x) || (op == '~' & amp; & amp; a.second != x)) {<!-- --> usr.insert(DN); } break; // This user does not need to check other properties } } } return usr; } set<int> eval_expr() {<!-- --> // expression evaluation set<int> usr; char c; if(!(cin >> c)) {<!-- --> //EOF } else if(isdigit(c)) {<!-- --> // base expr cin.unget(); // fallback character c usr = eval_base(); } else {<!-- --> assert(false); // not implemented } return usr; } void print(const set<int> & amp; usr) {<!-- --> bool first = true; for(auto DN : usr) {<!-- --> if(first) {<!-- --> first = false; } else {<!-- --> cout << ' '; } cout << DN; } cout << '\ '; } int main(){<!-- --> int n, m; cin >> n; for(int DN, ak; n > 0 & amp; & amp; cin >> DN >> ak; --n) {<!-- --> map<int, int> & attribs = db[DN]; for(int aid, x; ak > 0 & amp; & amp; cin >> aid >> x; --ak) {<!-- --> attribs[aid] = x; } } cin >> m >> ws; for(int i = 0; i < m; + + i) {<!-- --> print(eval_expr()); } }
v2: Logical expressions (40%)
For 40% of the input… the expression contains a logical combination of at most two atomic expressions, which conforms to the BNF grammar EASY_EXPR.
- Ordinary expressions must start with numbers, while logical expressions start with
& amp;
or|
. - When parsing, read a character ahead to determine the expression type. If it is a number, it will fall back to
cin.unget()
, and then parse it according to a simple expression. - The evaluation result of the basic expression is
set
, intersection and union operations can be realized byset_intersection
andset_union
.
It took a few more minutes to write the submission: running error 40 2.078s 58.55MB
, still in line with expectations, because multi-level nested expressions have not been implemented yet. Everything is under control.
#include <bits/stdc++.h> using namespace std; using db_t = map<int, map<int, int> >; // u: user, a: attribute using us_t = set<int>; // user list db_t db; // database, {uid: {attrib_id : value} } us_t eval_base() {<!-- --> // base expression int aid, x; char op; cin >> aid >> op >> x; us_t us; for(auto & amp; m : db) {<!-- --> const int DN = m.first; for(auto & amp; a : m.second) {<!-- --> if(a.first == aid) {<!-- --> // target attribute exists if((op == ':' & amp; & amp; a.second == x) || (op == '~' & amp; & amp; a.second != x)) {<!-- --> us.insert(DN); } break; // This user does not need to check other properties } } } return us; } us_t eval_nested() {<!-- --> // bracket nested expression char c; cin >> c; assert(c == '('); us_t us = eval_base(); cin >> c; assert(c == ')'); return us; } us_t eval_expr() {<!-- --> // expression evaluation us_t us; char c; if(!(cin >> c)) {<!-- --> //EOF } else if(isdigit(c)) {<!-- --> // base expr cin.unget(); // fallback character c us = eval_base(); } else {<!-- --> const char logic = c; auto a = eval_nested(); auto b = eval_nested(); if(logic == ' & amp;') {<!-- --> set_intersection(a.begin(), a.end(), b.begin(), b.end(), inserter(us, us.begin())); } else {<!-- --> assert(logic == '|'); set_union (a.begin(), a.end(), b.begin(), b.end(), inserter(us, us.begin())); } } return us; } void print(const us_t & amp; us) {<!-- --> bool first = true; for(auto DN : us) {<!-- --> if(first) {<!-- --> first = false; } else {<!-- --> cout << ' '; } cout << DN; } cout << '\ '; } int main(){<!-- --> int n, m; cin >> n; for(int DN, ak; n > 0 & amp; & amp; cin >> DN >> ak; --n) {<!-- --> map<int, int> & attribs = db[DN]; for(int aid, x; ak > 0 & amp; & amp; cin >> aid >> x; --ak) {<!-- --> attribs[aid] = x; } } cin >> m >> ws; for(int i = 0; i < m; + + i) {<!-- --> print(eval_expr()); } }
v3: Nested expressions (40%)
At this time, I feel that victory is in sight, because on the basis of v2 only a little effort can support all grammars:
us_t eval_expr(); // forward declaration us_t eval_nested() {<!-- --> // bracket nested expression char c; cin >> c; assert(c == '('); us_t us = eval_expr(); // was eval_base(); cin >> c; assert(c == ')'); return us; }
Add forward declaration and recursively call eval_expr
to realize all grammatical functions. Commit with high hopes: run timed out 40 run timed out 58.55MB
v4: Futile optimization… (70%)
Time Limit Exceeded (TLE)… Looking back at the topic Time Limit: 12.0s
, I feel a little worried. The next several hours tried various options:
- Replace the attribute table
map
withunordered_map
- Slightly improved
- “The number of attributes for each user does not exceed 500, and the attribute number does not exceed 1e9”, discretization compression, mapping attribute IDs to continuous IDs?
- no change
- After the attribute ID is discretized, the attribute table can be replaced with a fixed-length vector
vector
, and the attribute ID is used as the index.(505) - Runtime error (RTE)… presumably array out of bounds. Could it be that I misunderstood the meaning of the question, there are 500 attributes for each person, but there are more different attributes for all users?
- Under the test code test, if the discretization attribute is more than 500, the wrong result will be output and the wrong answer (WA) will be triggered:
- Indeed changed from RTE to WA. I tried several times with 2000, etc., and the memory limit has been exceeded, and all attributes are far more than 500…
if(virtual_aid >= 505) {<!-- --> // The number of different attributes of each user is far more than 505, which indeed triggers an error cout << "OVERFLOW" << endl; return 0; }
v5: Data structure (100%, 7.31s)
I have used the data warehouse in the company before, and I have also been exposed to the research and development of the data engine Kudu. There are two types of memory databases:
- Online Transaction Processing (OLTP): pursue efficient addition, deletion, modification and query. Data is usually organized in rows (records).
- Online Analytical Processing (OLAP): The pursuit of efficient query performance. In addition to technologies such as partitioning and MPP, column storage is also commonly used.
The expression evaluation of this question is all data query without modification, similar to OLAP, which means that columnar storage should be used?
- Data structure:
map
, when reading in, user ID and attribute value are collected and saved by attribute.> - Algorithm: According to the attribute ID, all users with the attribute are directly obtained, and then filtered by the attribute value, without scanning all users.
Modify and submit:correct 100 7.312s 31.54MB
, passed! This is the most critical step of optimization.
v6: hash table (100%, 7.29s)
In the previous row-based data organization, using the hash table to save the attribute memory exceeded the limit. Is it faster to replace map
with unordered_map
after switching to columnar storage?
- Modify and submit:
Correct 100 7.296s 29.55MB
The optimization effect is weak.
v7: presort + vector (100%, 2.25s!)
Also found optimization points:
- Both intersection and union keep the results in order. If the results of the underlying expressions are ordered, then the final results are also ordered.
- So there is no need to use
set
. The set operation ofset
is not as fast as the orderlyvector
. - You can modify
eval_base
to sortvector
before returning the result. But the sorting is repeated every time the user list is filtered. - After parsing the data, the list of users corresponding to all attributes can be pre-sorted. The user list filtered out in the future is naturally ordered.
Modify and submit: Correct 100 2.250s 29.55MB
Another 5 seconds shorter!
#include <bits/stdc++.h> using namespace std; using us_t = vector<int>; // user set using uv_t = pair<int, int>; // (user, attribute value) using db_t = unordered_map<int, vector<uv_t> >; // {property number -> [uv_t]} db_t db; // columnar database us_t eval_base() {<!-- --> // base expression evaluation int aid, x; char op; cin >> aid >> op >> x; us_t us; if(op == ':') {<!-- --> // extract constant expression for(auto & amp; uv : db[aid]) {<!-- --> if(uv.second == x) {<!-- --> // property values are equal us.push_back(uv.first); } } } else {<!-- --> for(auto & amp; uv : db[aid]) {<!-- --> if(uv.second != x) {<!-- --> // attribute values are not equal us.push_back(uv.first); } } } return us; } us_t eval_expr(); // forward declaration us_t eval_nested() {<!-- --> // bracket nested expression char c; cin >> c; assert(c == '('); us_t us = eval_expr(); cin >> c; assert(c == ')'); return us; } us_t eval_expr() {<!-- --> // expression evaluation us_t us; char c; if(!(cin >> c)) {<!-- --> assert(false); // should not EOF } else if(isdigit(c)) {<!-- --> // base expression cin.unget(); // The character c belongs to the expression, fallback us = eval_base(); } else {<!-- --> // logical expression auto a = eval_nested(); auto b = eval_nested(); if(c == ' & amp;') {<!-- --> set_intersection(a.begin(), a.end(), b.begin(), b.end(), inserter(us, us.begin())); } else {<!-- --> assert(c == '|'); set_union (a.begin(), a.end(), b.begin(), b.end(), inserter(us, us.begin())); } } return us; } void print(const us_t & amp; us) {<!-- --> bool first = true; for(auto DN : us) {<!-- --> if(first) {<!-- --> first = false; } else {<!-- --> cout << ' '; } cout << DN; } cout << '\ '; } int main(){<!-- --> ios::sync_with_stdio(0); // See the next step of IO optimization. cin. tie(0); cout. tie(0); int n, m; cin >> n; for(int DN, ak; n > 0 & amp; & amp; cin >> DN >> ak; --n) {<!-- --> // per user for(int aid, x; ak > 0 & amp; & amp; cin >> aid >> x; --ak) {<!-- --> // each attribute db[aid].push_back(make_pair(DN, x)); // Add user and attribute values to the vector corresponding to the attribute. } } auto cmp = [](const uv_t & amp; a, const uv_t & amp; b) {<!-- --> return a.first < b.first; }; for(auto & amp; v : db) {<!-- --> // can only be sorted by user ID after all are read in sort(v. second. begin(), v. second. end(), cmp); } for(cin >> m; m > 0; --m) {<!-- --> print(eval_expr()); } }
v8: IO optimization (100%, 0.64s!)
- The maximum number of users is 2500, and the maximum number of attributes is 500. Each attribute is described by a serial number and a value of 2 numbers. There are about
2500 * 500 * 2 ~ 2.5M
numbers in total. - The maximum number of expressions is 500, and the maximum length is 2000 characters, which is
1M
characters.
So this question should have huge room for IO optimization:
int main() {<!-- --> ios::sync_with_stdio(0); cin. tie(0); cout. tie(0); ...
After breaking the shackles of C: correct 100 0.640s 29.60MB
, satisfied!
Lessons learned
- Mistake 1: Version 1 uses the method of reading the entire line and then parsing it. Version 2 changed to
istringstream
. Later experiments found that it is enough to directly mix and read fromcin
intoint char int
. - Mistake 2: In order to meet the requirement of orderly output, carelessly choose
set
to save data, resulting in significant overhead (close to 5s). - Mistake 3: At the beginning, I held the attitude of demonstration and mixed points, and started coding without analyzing the time complexity. If analyzed carefully, the correct data structure is quite obvious.
- Experience: once again verified the statement “When the appropriate data structure is designed, the algorithm is often straightforward and efficient“.