CSP-202303-3 LDAP – SyntaxBug

The 4th and 5th question types of CSP certification are like climbing Mount Everest, which is desperate. The third question type is like climbing Mount Tai, knowing that it can be solved, but it will always take hours of hard work. At the request of the students, try to do the third question of the CSP certification in March 2023: LDAP. For the questions, please refer to the mock exam on the CSP official website.
The pattern of the third question type is: parse data, build model, perform operation/answer query/transform output. The computational portion of this LDAP problem is Responding to queries.
The input of other question types is mostly numbers, which is relatively simple. The input of the third question type is usually text, and it is necessary to use the lexical analysis and grammatical analysis of the compilation principle to analyze the data.

v1: Basic expression (20%)

For 20% of the input… are atomic expressions… BASE_EXPR = ATTRIBUTE OPERATOR VALUE

Realize the simple situation first, hit a rabbit with a waistband:

Data structure: two-level mapping map>.
Algorithm: Brute force search for all users that has the attribute and the value meets the requirement (equal, unequal).
The user (DN), attribute number (aid), and attribute value (value) do not exceed 1e9, and do not involve arithmetic calculations, so it can be represented by int.
The basic expression can be read directly by using cin >> aid >> op >> val. Among them, char op; may be : or ~, which means equal or equal. // Error 1
The output user list needs to be sorted from small to large, so use set to save. // Mistake 2

Received 20 minutes soon after submitting: running error 20 2.015s 58.58MB. Runtime errors are also as expected, as assertions are set for unimplemented feature branches.

#include <bits/stdc++.h>
using namespace std;

using db_t = map<int, map<int, int> >; // u: user, a: attribute
db_t db; // database, {uid: {attrib_id : value} }

set<int> eval_base() {<!-- --> // base expression evaluation
  int aid, x;
  char op;
  cin >> aid >> op >> x;

  set<int> usr;
  for(auto & amp; m : db) {<!-- -->
    const int DN = m.first;
    for(auto & amp; a : m.second) {<!-- -->
      if(a.first == aid) {<!-- --> // target attribute exists
        if((op == ':' & amp; & amp; a.second == x) ||
           (op == '~' & amp; & amp; a.second != x)) {<!-- -->
          usr.insert(DN);
        }
        break; // This user does not need to check other properties
      }
    }
  }
  return usr;
}

set<int> eval_expr() {<!-- --> // expression evaluation
  set<int> usr;
  char c;
  if(!(cin >> c)) {<!-- -->
    //EOF
  } else if(isdigit(c)) {<!-- --> // base expr
    cin.unget(); // fallback character c
    usr = eval_base();
  } else {<!-- -->
    assert(false); // not implemented
  }
  return usr;
}

void print(const set<int> & amp; usr) {<!-- -->
  bool first = true;
  for(auto DN : usr) {<!-- -->
    if(first) {<!-- -->
      first = false;
    } else {<!-- -->
      cout << ' ';
    }
    cout << DN;
  }
  cout << '\
';
}

int main(){<!-- -->
  int n, m;
  cin >> n;
  for(int DN, ak; n > 0 & amp; & amp; cin >> DN >> ak; --n) {<!-- -->
    map<int, int> & attribs = db[DN];
    for(int aid, x; ak > 0 & amp; & amp; cin >> aid >> x; --ak) {<!-- -->
      attribs[aid] = x;
    }
  }
  cin >> m >> ws;
  for(int i = 0; i < m; + + i) {<!-- -->
    print(eval_expr());
  }
}

v2: Logical expressions (40%)

For 40% of the input… the expression contains a logical combination of at most two atomic expressions, which conforms to the BNF grammar EASY_EXPR.

Ordinary expressions must start with numbers, while logical expressions start with & amp; or |.
When parsing, read a character ahead to determine the expression type. If it is a number, it will fall back to cin.unget(), and then parse it according to a simple expression.
The evaluation result of the basic expression is set, intersection and union operations can be realized by set_intersection and set_union.

It took a few more minutes to write the submission: running error 40 2.078s 58.55MB, still in line with expectations, because multi-level nested expressions have not been implemented yet. Everything is under control.

#include <bits/stdc++.h>
using namespace std;

using db_t = map<int, map<int, int> >; // u: user, a: attribute
using us_t = set<int>; // user list
db_t db; // database, {uid: {attrib_id : value} }

us_t eval_base() {<!-- --> // base expression
  int aid, x;
  char op;
  cin >> aid >> op >> x;

  us_t us;
  for(auto & amp; m : db) {<!-- -->
    const int DN = m.first;
    for(auto & amp; a : m.second) {<!-- -->
      if(a.first == aid) {<!-- --> // target attribute exists
        if((op == ':' & amp; & amp; a.second == x) ||
           (op == '~' & amp; & amp; a.second != x)) {<!-- -->
          us.insert(DN);
        }
        break; // This user does not need to check other properties
      }
    }
  }
  return us;
}

us_t eval_nested() {<!-- --> // bracket nested expression
  char c;
  cin >> c; assert(c == '(');
  us_t us = eval_base();
  cin >> c; assert(c == ')');
  return us;
}

us_t eval_expr() {<!-- --> // expression evaluation
  us_t us;
  char c;
  if(!(cin >> c)) {<!-- -->
    //EOF
  } else if(isdigit(c)) {<!-- --> // base expr
    cin.unget(); // fallback character c
    us = eval_base();
  } else {<!-- -->
    const char logic = c;
    auto a = eval_nested();
    auto b = eval_nested();
    if(logic == ' & amp;') {<!-- -->
      set_intersection(a.begin(), a.end(), b.begin(), b.end(), inserter(us, us.begin()));
    } else {<!-- --> assert(logic == '|');
      set_union (a.begin(), a.end(), b.begin(), b.end(), inserter(us, us.begin()));
    }
  }
  return us;
}

void print(const us_t & amp; us) {<!-- -->
  bool first = true;
  for(auto DN : us) {<!-- -->
    if(first) {<!-- -->
      first = false;
    } else {<!-- -->
      cout << ' ';
    }
    cout << DN;
  }
  cout << '\
';
}

int main(){<!-- -->
  int n, m;
  cin >> n;
  for(int DN, ak; n > 0 & amp; & amp; cin >> DN >> ak; --n) {<!-- -->
    map<int, int> & attribs = db[DN];
    for(int aid, x; ak > 0 & amp; & amp; cin >> aid >> x; --ak) {<!-- -->
      attribs[aid] = x;
    }
  }
  cin >> m >> ws;
  for(int i = 0; i < m; + + i) {<!-- -->
    print(eval_expr());
  }
}

v3: Nested expressions (40%)

At this time, I feel that victory is in sight, because on the basis of v2 only a little effort can support all grammars:

us_t eval_expr(); // forward declaration

us_t eval_nested() {<!-- --> // bracket nested expression
  char c;
  cin >> c; assert(c == '(');
  us_t us = eval_expr(); // was eval_base();
  cin >> c; assert(c == ')');
  return us;
}

Add forward declaration and recursively call eval_expr to realize all grammatical functions. Commit with high hopes: run timed out 40 run timed out 58.55MB

v4: Futile optimization… (70%)

Time Limit Exceeded (TLE)… Looking back at the topic Time Limit: 12.0s, I feel a little worried. The next several hours tried various options:

Replace the attribute table map with unordered_map
- Slightly improved
“The number of attributes for each user does not exceed 500, and the attribute number does not exceed 1e9”, discretization compression, mapping attribute IDs to continuous IDs?
- no change
After the attribute ID is discretized, the attribute table can be replaced with a fixed-length vector vector(505), and the attribute ID is used as the index.
- Runtime error (RTE)… presumably array out of bounds. Could it be that I misunderstood the meaning of the question, there are 500 attributes for each person, but there are more different attributes for all users?
Under the test code test, if the discretization attribute is more than 500, the wrong result will be output and the wrong answer (WA) will be triggered:
- Indeed changed from RTE to WA. I tried several times with 2000, etc., and the memory limit has been exceeded, and all attributes are far more than 500…
```
 if(virtual_aid >= 505) { // The number of different attributes of each user is far more than 505, which indeed triggers an error
      cout << "OVERFLOW" << endl;
      return 0;
    }
```

v5: Data structure (100%, 7.31s)

I have used the data warehouse in the company before, and I have also been exposed to the research and development of the data engine Kudu. There are two types of memory databases:

Online Transaction Processing (OLTP): pursue efficient addition, deletion, modification and query. Data is usually organized in rows (records).
Online Analytical Processing (OLAP): The pursuit of efficient query performance. In addition to technologies such as partitioning and MPP, column storage is also commonly used.

The expression evaluation of this question is all data query without modification, similar to OLAP, which means that columnar storage should be used?

Data structure: map >, when reading in, user ID and attribute value are collected and saved by attribute.
Algorithm: According to the attribute ID, all users with the attribute are directly obtained, and then filtered by the attribute value, without scanning all users.
Modify and submit: correct 100 7.312s 31.54MB, passed! This is the most critical step of optimization.

v6: hash table (100%, 7.29s)

In the previous row-based data organization, using the hash table to save the attribute memory exceeded the limit. Is it faster to replace map with unordered_map after switching to columnar storage?

Modify and submit: Correct 100 7.296s 29.55MB The optimization effect is weak.

v7: presort + vector (100%, 2.25s!)

Also found optimization points:

Both intersection and union keep the results in order. If the results of the underlying expressions are ordered, then the final results are also ordered.
So there is no need to use set. The set operation of set is not as fast as the orderly vector.
You can modify eval_base to sort vector before returning the result. But the sorting is repeated every time the user list is filtered.
After parsing the data, the list of users corresponding to all attributes can be pre-sorted. The user list filtered out in the future is naturally ordered.

Modify and submit: Correct 100 2.250s 29.55MB Another 5 seconds shorter!

#include <bits/stdc++.h>
using namespace std;

using us_t = vector<int>; // user set
using uv_t = pair<int, int>; // (user, attribute value)
using db_t = unordered_map<int, vector<uv_t> >; // {property number -> [uv_t]}
db_t db; // columnar database

us_t eval_base() {<!-- --> // base expression evaluation
  int aid, x;
  char op;
  cin >> aid >> op >> x;
  us_t us;
  if(op == ':') {<!-- --> // extract constant expression
    for(auto & amp; uv : db[aid]) {<!-- -->
      if(uv.second == x) {<!-- --> // property values are equal
        us.push_back(uv.first);
      }
    }
  } else {<!-- -->
    for(auto & amp; uv : db[aid]) {<!-- -->
      if(uv.second != x) {<!-- --> // attribute values are not equal
        us.push_back(uv.first);
      }
    }
  }
  return us;
}

us_t eval_expr(); // forward declaration

us_t eval_nested() {<!-- --> // bracket nested expression
  char c;
  cin >> c; assert(c == '(');
  us_t us = eval_expr();
  cin >> c; assert(c == ')');
  return us;
}

us_t eval_expr() {<!-- --> // expression evaluation
  us_t us;
  char c;
  if(!(cin >> c)) {<!-- -->
    assert(false); // should not EOF
  } else if(isdigit(c)) {<!-- --> // base expression
    cin.unget(); // The character c belongs to the expression, fallback
    us = eval_base();
  } else {<!-- --> // logical expression
    auto a = eval_nested();
    auto b = eval_nested();
    if(c == ' & amp;') {<!-- -->
      set_intersection(a.begin(), a.end(), b.begin(), b.end(), inserter(us, us.begin()));
    } else {<!-- --> assert(c == '|');
      set_union (a.begin(), a.end(), b.begin(), b.end(), inserter(us, us.begin()));
    }
  }
  return us;
}

void print(const us_t & amp; us) {<!-- -->
  bool first = true;
  for(auto DN : us) {<!-- -->
    if(first) {<!-- -->
      first = false;
    } else {<!-- -->
      cout << ' ';
    }
    cout << DN;
  }
  cout << '\
';
}

int main(){<!-- -->
  ios::sync_with_stdio(0); // See the next step of IO optimization.
  cin. tie(0);
  cout. tie(0);
  int n, m;
  cin >> n;
  for(int DN, ak; n > 0 & amp; & amp; cin >> DN >> ak; --n) {<!-- --> // per user
    for(int aid, x; ak > 0 & amp; & amp; cin >> aid >> x; --ak) {<!-- --> // each attribute
      db[aid].push_back(make_pair(DN, x)); // Add user and attribute values to the vector corresponding to the attribute.
    }
  }
  auto cmp = [](const uv_t & amp; a, const uv_t & amp; b) {<!-- --> return a.first < b.first; };
  for(auto & amp; v : db) {<!-- --> // can only be sorted by user ID after all are read in
    sort(v. second. begin(), v. second. end(), cmp);
  }
  for(cin >> m; m > 0; --m) {<!-- -->
    print(eval_expr());
  }
}

v8: IO optimization (100%, 0.64s!)

The maximum number of users is 2500, and the maximum number of attributes is 500. Each attribute is described by a serial number and a value of 2 numbers. There are about 2500 * 500 * 2 ~ 2.5M numbers in total.
The maximum number of expressions is 500, and the maximum length is 2000 characters, which is 1M characters.

So this question should have huge room for IO optimization:

int main() {<!-- -->
  ios::sync_with_stdio(0);
  cin. tie(0);
  cout. tie(0);
...

After breaking the shackles of C: correct 100 0.640s 29.60MB, satisfied!

Lessons learned

Mistake 1: Version 1 uses the method of reading the entire line and then parsing it. Version 2 changed to istringstream. Later experiments found that it is enough to directly mix and read from cin into int char int.
Mistake 2: In order to meet the requirement of orderly output, carelessly choose set to save data, resulting in significant overhead (close to 5s).
Mistake 3: At the beginning, I held the attitude of demonstration and mixed points, and started coding without analyzing the time complexity. If analyzed carefully, the correct data structure is quite obvious.
Experience: once again verified the statement “When the appropriate data structure is designed, the algorithm is often straightforward and efficient“.