Article directory Overview Hugging Face obs operation git-lfs example RedPajama-Data-1T SlimPajama-627B/ git clone resume Data Format References Overview Corpus is very important in large model training. Currently, there are various corpora available for download on the public Internet, but it is impossible for every user and every training task. To pull corpus through the public […]
Tag: corpus
Write corpus text to database 20231104
import java.sql.Connection; import java.sql.DriverManager; import java.sql.PreparedStatement; import java.sql.ResultSet; public class BaseDao { public Connection conn = null; public PreparedStatement ps = null; public ResultSet rs = null; public void getConnection() throws Exception { Class.forName(“com.mysql.cj.jdbc.Driver”); conn = DriverManager.getConnection(“jdbc:mysql://localhost:3306/languages_material_database?serverTimezone=Asia/Shanghai & amp; & amp;useTimezone=true”, “root”, “123456”); } public ResultSet executeQuery(String sql, Object[] param) throws Exception { this.getConnection(); ps […]
Self-instruct way to generate corpus code actual combat
Practical combat of generating corpus code by self-instruct self-instruct introduction self-instruct frame Generate corpus code implementation process Step1 Generate new instructions through the model Step2 Judge the instructions generated by the model Step3: According to the judgment result of Step2, give different output Step4: Filtration and post-processing This article analyzes the process of generating corpus […]
Natural Language Processing (1) Brown Corpus
What is natural language processing? Natural language processing is an important direction in the field of computer science and artificial intelligence. It is a science that combines linguistics, computer science, and mathematics. The full English name of natural language processing is: Natural Language Processing People are accustomed to abbreviate it as NLP. In simple terms, […]
[Solved] wikiextractor extracts Wikipedia corpus error solution
When I extracted the Wikipedia corpus, the wikiextractor I used at the beginning, and later found that it always reported an error, so it was useless. Since many people are asking me how to extract it, I will now publish the code. The code is not written by me, but found from a website. Because […]