The big data innovation course I studied recently is about to end. It requires a large assignment defense. After completing it, I will share it so that friends in need can learn from it.
Data crawling
Beautifulsoup part description:
Since the web interface uses static page distribution, and due to personal preference, I use Beautifulsoup for web page analysis. Friends who are not familiar with Beautifulsoup can go to the official documentation of Beautiful Soup 4.4.0 https://beautifulsoup.cn/ to learn more. .
Xpash part description:
Of course, you can also use Xpash to obtain web page tags. Personally, I think xpash is more convenient. You can press the F11 key on the browser web page to enter the developer mode and directly copy the Xpash path, which will be easier.
Crawler code display
The next step is to attach the complete code for the data collection crawler:
import requests from bs4 import BeautifulSoup import csv import re def get_date(url,time=10): headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.69"} try: rel = requests.get(url,timeout = time,headers = headers) rel.encoding = rel.apparent_encoding rel.raise_for_status() return rel.text except Exception as err: print(err) def find_date(rel): pricelist=[] arealist=[] northlist=[] floorlist=[] homelist1=[] homelist2=[] homelist3=[] result=[] count=[] soup = BeautifulSoup(rel,"html.parser") targets1 = soup.select(".content__list>div>div") i=0 for item in targets1: try: err=re.sub(r'[^a-zA-Z0-9\s]','', item.select("p")[1].text.strip(). split("room")[0].split("/")[-1].strip()) if "0"<err<"5": homelist1.append(err) else: count.append(i) continue area=re.findall(r'[a-zA-Z0-9.]', item.select("p")[1].text.strip().split("㎡") [0].split("/")[-1]) output_string = ''.join(area) arealist.append(output_string) northlist.append(item.select("p")[1].text.strip().split("㎡")[1].split("/")[1].strip( )) homelist2.append(item.select("p")[1].text.strip().split("room")[1].split("hall")[0]) homelist3.append(item.select("p")[1].text.strip().split("hall")[1].split("卫")[0]) floorlist.append(item.select("span")[0].text.split("layer")[1].split("(")[1]) i=i+1 except Exception as err: continue targets = soup.select(".content__list>div>div>span") i=0 j=0 for item in targets: if j<len(count): if i == count[j]: j=j+1 continue else: pricelist.append(item.text.strip().split("yuan/month")[0]) i=i+1 len1 = len(pricelist) for i in range(len1): result.append([pricelist[i],arealist[i],northlist[i],floorlist[i],homelist1[i],homelist2[i],homelist3[i]]) return result def save_data(path,data): with open(path,"a + ", newline="", encoding="utf-8-sig") as f: csv_writer = csv.writer(f) csv_writer.writerows(data) if __name__ == "__main__": for i in range(100): url = "https://bj.lianjia.com/zufang/chaoyang/pg" + str(i + 1) result = [] rel = get_date(url) result = find_date(rel) save_data("Beijing Chaoyang Rental.csv",result) print("No." + str(i + 1) + "Page crawling completed")
If there is any area that can be improved, I hope uumeng can help me point it out (with fists in my arms)! ! !
Crawled content display
There are seven columns of data in total. Of course, these are just collected data and have not been cleaned yet. The data will be further cleaned through pandas later.
Data cleaning
The data cleaning and data modeling parts are performed in Jupyter Notebook.
Introduction to Jupyter Notebook
What is Jupyter Notebook
Jupyter Notebook is a web-based interactive computing environment that supports multiple programming languages, including Python, R, Julia, etc. Its main function is to combine code, text, mathematical equations, visualizations and other related elements to create a dynamic document for use in data analysis, machine learning, scientific computing and data visualization. Jupyter Notebook provides an interactive interface that enables users to build and execute code incrementally and visually.
Import data
import pandas as pd import numpy as np import matplotlib.pyplot as plt data = pd.read_csv(r"C:\Users\86132\Desktop\Beijing Chaoyang Rental.csv") data
Data Display
Check whether there are missing values in the data
Missing values can be filled in through a variety of methods. If the amount of data is large enough, you can also delete it directly (I am lazy, so I chose to delete, uumeng, don’t learn from me)
At the same time, we also introduce some methods for dealing with missing values:
(1) Delete
Delete samples (rows) or features (columns) with missing information attribute values to obtain a complete data table.
(2)Interpolation
1. Handwriting
When you know the data set well enough, you can choose to fill in the missing values yourself. However, generally speaking, this method is time-consuming and is not feasible when the data size is large and there are many null values. Generally not recommended.
2. Special value filling
Treat null values as a special attribute value that is different from any other attribute value. For example, all empty values are filled with “unknown”. Generally used as temporary filling or intermediate process. Sometimes it may lead to serious data deviation and is generally not recommended.
3. Average filling
The attributes in the initial data set are divided into numerical attributes and non-numeric attributes to be processed separately.
If the null value is numeric, fill in the missing attribute value based on the average value of the attribute in all other objects; if the null value is non-numeric, use the mode principle in statistics to fill in the missing attribute value. The attribute has the most values in all other objects (that is, the value with the highest frequency) to fill in the missing attribute value.
Another similar method is called the conditional mean filling method. In this method, the value used for averaging is not taken from all the objects in the data set, but from the objects that have the same decision attribute value as the object. The basic starting point of these two data completion methods is the same, to supplement the missing attribute values with the highest probability possible values, but there is a little difference in the specific methods. Compared with other methods, it uses most of the information in the existing data to infer missing values.
4. K nearest neighbor method
First, determine the K samples closest to the sample with missing data based on Euclidean distance or correlation analysis, and use the weighted average of these K values to estimate the missing data of the sample.
5. Return
Build a regression equation based on the complete data set, or use regression algorithms in machine learning. For objects that contain null values, the unknown attribute values are estimated by substituting the known attribute values into the equation and filling them with the estimated values. Biased estimates result when variables are not linearly related. More commonly used.
data.isnull().sum() data.dropna(inplace=True)
Remove duplicates
data.drop_duplicates(inplace=True)
Quantitative processing of data
Since the real estate orientation is in the form of a string, which is very unfriendly for subsequent data modeling, the orientation is quantified. First, because some orientations have spaces in the middle, they are removed first:
data['dirction'] = data['dirction'].str.replace(' ', '') print(data)
The next step is to first create a mapping dictionary, and then use the mapping dictionary to perform quantitative replacement
direction_mapping ={ 'East': 1, 'South': 2, 'West': 3, 'North': 4, 'Northeast': 5, 'Southeast': 6, 'Northwest': 7, 'Southwest': 8, 'North and South':9 } # Use mapping dictionary for quantified replacement data['dirction'] = data['dirction'].replace(direction_mapping) print(data)
Among them, some properties in the north and south have dual orientations, so they are listed separately.
Since the job requires more than 1,000 pieces of data, all outliers that appear after quantification are directly deleted (actually because the blogger is lazy and doesn’t want to deal with them anymore haha)
Visualization
Histogram
The histogram, also known as the quality distribution chart, is a statistical report chart. It is based on the distribution of specific data and is drawn into a series of connected histograms with the group distance as the base and frequency as the height.
Used to display the distribution of data, such as the mode, the approximate location of the median, and whether there are gaps or outliers in the data.
import matplotlib matplotlib.rc("font",family='YouYuan') prices = data['price'] # Draw histogram plt.figure(figsize=(8, 6)) plt.hist(prices, bins=20, color='blue', edgecolor='black') # Set 20 columns plt.title('Rent Price Histogram') plt.xlabel('price') plt.ylabel('number of houses') plt.show()
Because matplotlib cannot display Chinese very well, I checked various information and came up with a solution. I only need to add a code to the network code:
matplotlib.rc("font",family='YouYuan')#Write the font you need in family
Although I don’t know why this is, but it works! ! !
In the end, my histogram was successfully displayed.
Correlation matrix
Correlation matrix is a basic tool for data analysis. They allow us to understand how different variables relate to each other.
Relevance
Correlation is a concept in statistics that measures the strength and direction of a linear relationship between two random variables. If one variable increases when the other variable also increases, then the two variables have a positive correlation; conversely, if one variable increases when the other variable decreases, then the two variables have a negative correlation. The value of correlation is between -1 and 1, with 0 indicating no correlation and 1 or -1 indicating a perfect positive or negative correlation.
Correlation Matrix
Can help us: Understand the relationship between variables. Discover possible multicollinearity issues (multicollinearity refers to high correlation between independent variables, which may lead to unstable coefficient estimates of linear regression models). Provide a basis for feature selection (if two variables are highly correlated, we may only need to retain one of them)
We can directly obtain the correlation coefficient matrix through the function corr() written in the pandas library in python:
features = data[["area","dirction","floor","bed","sitting","bath","price"] ] correlation_matrix = features.corr() print(correlation_matrix)
But because this looks too ugly, not fancy enough, and the relationship between them cannot be clearly seen, so next we draw a heat map through the obtained correlation coefficient matrix:
heat map
plt.figure(figsize=(8, 6)) matplotlib.rc("font",family='YouYuan') sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5) plt.title('Feature correlation coefficient matrix heat map') plt.show()
In this way, the correlation between them can be clearly seen at a glance, and it is also more beautiful.
Machine learning model
Univariate linear regression
Univariate linear regression is a method of analyzing the linear correlation between only one independent variable (independent variable x and dependent variable y). The value of an economic indicator is often affected by many factors. If only one factor is main and plays a decisive role, linear regression can be used for prediction analysis.
We first model the relationship between area and price
x = data.loc[:,"area"] y = data.loc[:,"price"]
Import LinearRegression library
from sklearn.linear_model import LinearRegression lr_model = LinearRegression()
Transform data dimensions
x = np.array(x) x = x.reshape(-1,1) y = np.array(y) y = y.reshape(-1,1) print(type(x),x.shape,type(y),y.shape)
Train the model and get the predicted values and coefficients and intercepts of the one-variable linear model:
lr_model.fit(x,y) y_predict = lr_model.predict(x) print(y_predict) a = lr_model.coef_ b = lr_model.intercept_ print(a,b)
Use the R2 score to see how accurate your model is
from sklearn.metrics import mean_squared_error,r2_score MSE = mean_squared_error(y,y_predict) R2 = r2_score(y,y_predict) print(MSE,R2)
But the model accuracy is not very ideal, only 0.66.
Multiple linear regression
So I further chose the Multiple linear regression model for regression
X_multi = data.drop(['price'],axis=1) X_multi
The same linear regression code is almost the same, except that there are some more x parameters:
LR_multi = LinearRegression() LR_multi.fit(X_multi,y)
Getpredicted values:
y_predict_multi = LR_multi.predict(X_multi) print(y_predict_multi)
Conduct model evaluation again:
mean_squared_error_multi = mean_squared_error(y,y_predict_multi) r2_score_multi = r2_score(y,y_predict_multi) print(mean_squared_error_multi,r2_score_multi)
But the result is still not very friendly, it just improved from 0.66 to 0.69. I was wondering if this data set is not suitable for linear regression, but the question requires using Linear regression, is there something wrong with the data set? In order to verify the rationality of linear regression, I chose to change to a machine learning modelrandom forest Algorithmremodel and the results of the two models arecompared.
Random Forest Regression
The idea of random forest
Random forest belongs to ensemble learning. Its core idea is to integrate multiple weak classifiers to achieve the effect of three stooges outperforming Zhuge Liang. Random forest adopts the idea of Bagging. The so-called Bagging is:
(1) Take n training samples from the training set with replacement each time to form a new training set;
(2) Use the new training set to train M sub-models;
(3) For classification problems, the voting method is used, and the classification category of the sub-model with the most votes is the final category; for regression problems, a simple average method is used to obtain >Predicted value.
Multiple Decision Trees are combined together, each time the data set is randomly selected with replacement, and some features are randomly selected as input, so the algorithm is called a random forest algorithm. It can be seen that the random forest algorithm is a bagging algorithm using decision trees as estimators.
Next, attach the code directly:
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score #Extract features and labels X = data[["area","dirction","floor","bed","sitting","bath"]] # Features y = data['price'] # label # Divide training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #Create a random forest regression model rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42) #Train model rf_regressor.fit(X_train, y_train) # Use the model to make predictions y_pred = rf_regressor.predict(X_test) # Evaluate model performance mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("Mean Square Error (MSE):", mse) print("Coefficient of determination (R^2):", r2)
Conclusion
The final R2 score obtained through random forest regression is still 0.67, so I think it should be a problem with the data set, not the model. There will be no further changes in the future. When it comes out, I will send uumeng specific improvement methods.
The above is the entire process of my big assignment. I hope it can be helpful to uumeng.