Python analyzes Beijing Chaoyang Lianjia rental information – data analysis modeling practice

The big data innovation course I studied recently is about to end. It requires a large assignment defense. After completing it, I will share it so that friends in need can learn from it.

Data crawling

Beautifulsoup part description:

Since the web interface uses static page distribution, and due to personal preference, I use Beautifulsoup for web page analysis. Friends who are not familiar with Beautifulsoup can go to the official documentation of Beautiful Soup 4.4.0 https://beautifulsoup.cn/ to learn more. .

Xpash part description:

Of course, you can also use Xpash to obtain web page tags. Personally, I think xpash is more convenient. You can press the F11 key on the browser web page to enter the developer mode and directly copy the Xpash path, which will be easier.

Crawler code display

The next step is to attach the complete code for the data collection crawler:

import requests
from bs4 import BeautifulSoup
import csv
import re
def get_date(url,time=10):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.69"}
    try:
        rel = requests.get(url,timeout = time,headers = headers)
        rel.encoding = rel.apparent_encoding
        rel.raise_for_status()
        return rel.text
    except Exception as err:
        print(err)
def find_date(rel):
    pricelist=[]
    arealist=[]
    northlist=[]
    floorlist=[]
    homelist1=[]
    homelist2=[]
    homelist3=[]
    result=[]
    count=[]
    soup = BeautifulSoup(rel,"html.parser")
    targets1 = soup.select(".content__list>div>div")
    i=0
    for item in targets1:
        try:
            err=re.sub(r'[^a-zA-Z0-9\s]','', item.select("p")[1].text.strip(). split("room")[0].split("/")[-1].strip())
            if "0"<err<"5":
                homelist1.append(err)
            else:
                count.append(i)
                continue
            area=re.findall(r'[a-zA-Z0-9.]', item.select("p")[1].text.strip().split("㎡") [0].split("/")[-1])
            output_string = ''.join(area)
            arealist.append(output_string)
            northlist.append(item.select("p")[1].text.strip().split("㎡")[1].split("/")[1].strip( ))
            homelist2.append(item.select("p")[1].text.strip().split("room")[1].split("hall")[0])
            homelist3.append(item.select("p")[1].text.strip().split("hall")[1].split("卫")[0])
            floorlist.append(item.select("span")[0].text.split("layer")[1].split("（")[1])
            i=i+1
        except Exception as err:
            continue
    targets = soup.select(".content__list>div>div>span")
    i=0
    j=0
    for item in targets:
        if j<len(count):
            if i == count[j]:
                j=j+1
                continue
            else:
                pricelist.append(item.text.strip().split("yuan/month")[0])
                i=i+1
    len1 = len(pricelist)
    for i in range(len1):
        result.append([pricelist[i],arealist[i],northlist[i],floorlist[i],homelist1[i],homelist2[i],homelist3[i]])
    return result

def save_data(path,data):
    with open(path,"a + ", newline="", encoding="utf-8-sig") as f:
        csv_writer = csv.writer(f)
        csv_writer.writerows(data)
if __name__ == "__main__":
    for i in range(100):
        url = "https://bj.lianjia.com/zufang/chaoyang/pg" + str(i + 1)
        result = []
        rel = get_date(url)
        result = find_date(rel)
        save_data("Beijing Chaoyang Rental.csv",result)
        print("No." + str(i + 1) + "Page crawling completed")

If there is any area that can be improved, I hope uumeng can help me point it out (with fists in my arms)! ! !

Crawled content display

There are seven columns of data in total. Of course, these are just collected data and have not been cleaned yet. The data will be further cleaned through pandas later.

Data cleaning

The data cleaning and data modeling parts are performed in Jupyter Notebook.

Introduction to Jupyter Notebook

What is Jupyter Notebook

Jupyter Notebook is a web-based interactive computing environment that supports multiple programming languages, including Python, R, Julia, etc. Its main function is to combine code, text, mathematical equations, visualizations and other related elements to create a dynamic document for use in data analysis, machine learning, scientific computing and data visualization. Jupyter Notebook provides an interactive interface that enables users to build and execute code incrementally and visually.

Import data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\86132\Desktop\Beijing Chaoyang Rental.csv")
data

Data Display

Check whether there are missing values in the data

Missing values can be filled in through a variety of methods. If the amount of data is large enough, you can also delete it directly (I am lazy, so I chose to delete, uumeng, don’t learn from me)

At the same time, we also introduce some methods for dealing with missing values:

(1) Delete

Delete samples (rows) or features (columns) with missing information attribute values to obtain a complete data table.

(2)Interpolation

1. Handwriting

When you know the data set well enough, you can choose to fill in the missing values yourself. However, generally speaking, this method is time-consuming and is not feasible when the data size is large and there are many null values. Generally not recommended.

2. Special value filling

Treat null values as a special attribute value that is different from any other attribute value. For example, all empty values are filled with “unknown”. Generally used as temporary filling or intermediate process. Sometimes it may lead to serious data deviation and is generally not recommended.

3. Average filling

The attributes in the initial data set are divided into numerical attributes and non-numeric attributes to be processed separately.

If the null value is numeric, fill in the missing attribute value based on the average value of the attribute in all other objects; if the null value is non-numeric, use the mode principle in statistics to fill in the missing attribute value. The attribute has the most values in all other objects (that is, the value with the highest frequency) to fill in the missing attribute value.

Another similar method is called the conditional mean filling method. In this method, the value used for averaging is not taken from all the objects in the data set, but from the objects that have the same decision attribute value as the object. The basic starting point of these two data completion methods is the same, to supplement the missing attribute values with the highest probability possible values, but there is a little difference in the specific methods. Compared with other methods, it uses most of the information in the existing data to infer missing values.

4. K nearest neighbor method

First, determine the K samples closest to the sample with missing data based on Euclidean distance or correlation analysis, and use the weighted average of these K values to estimate the missing data of the sample.

5. Return

Build a regression equation based on the complete data set, or use regression algorithms in machine learning. For objects that contain null values, the unknown attribute values are estimated by substituting the known attribute values into the equation and filling them with the estimated values. Biased estimates result when variables are not linearly related. More commonly used.

data.isnull().sum()
data.dropna(inplace=True)

Remove duplicates

data.drop_duplicates(inplace=True)

Quantitative processing of data

Since the real estate orientation is in the form of a string, which is very unfriendly for subsequent data modeling, the orientation is quantified. First, because some orientations have spaces in the middle, they are removed first:

data['dirction'] = data['dirction'].str.replace(' ', '')
print(data)

The next step is to first create a mapping dictionary, and then use the mapping dictionary to perform quantitative replacement

direction_mapping ={
    'East': 1,
    'South': 2,
    'West': 3,
    'North': 4,
    'Northeast': 5,
    'Southeast': 6,
    'Northwest': 7,
    'Southwest': 8,
    'North and South':9
}

# Use mapping dictionary for quantified replacement
data['dirction'] = data['dirction'].replace(direction_mapping)

print(data)

Among them, some properties in the north and south have dual orientations, so they are listed separately.

Since the job requires more than 1,000 pieces of data, all outliers that appear after quantification are directly deleted (actually because the blogger is lazy and doesn’t want to deal with them anymore haha)

Visualization

Histogram

The histogram, also known as the quality distribution chart, is a statistical report chart. It is based on the distribution of specific data and is drawn into a series of connected histograms with the group distance as the base and frequency as the height.

Used to display the distribution of data, such as the mode, the approximate location of the median, and whether there are gaps or outliers in the data.

import matplotlib
matplotlib.rc("font",family='YouYuan')
prices = data['price']

# Draw histogram
plt.figure(figsize=(8, 6))
plt.hist(prices, bins=20, color='blue', edgecolor='black') # Set 20 columns
plt.title('Rent Price Histogram')
plt.xlabel('price')
plt.ylabel('number of houses')
plt.show()

Because matplotlib cannot display Chinese very well, I checked various information and came up with a solution. I only need to add a code to the network code:

matplotlib.rc("font",family='YouYuan')#Write the font you need in family

Although I don’t know why this is, but it works! ! !

In the end, my histogram was successfully displayed.

Correlation matrix

Correlation matrix is a basic tool for data analysis. They allow us to understand how different variables relate to each other.

Relevance

Correlation is a concept in statistics that measures the strength and direction of a linear relationship between two random variables. If one variable increases when the other variable also increases, then the two variables have a positive correlation; conversely, if one variable increases when the other variable decreases, then the two variables have a negative correlation. The value of correlation is between -1 and 1, with 0 indicating no correlation and 1 or -1 indicating a perfect positive or negative correlation.

Correlation Matrix

Can help us: Understand the relationship between variables. Discover possible multicollinearity issues (multicollinearity refers to high correlation between independent variables, which may lead to unstable coefficient estimates of linear regression models). Provide a basis for feature selection (if two variables are highly correlated, we may only need to retain one of them)

We can directly obtain the correlation coefficient matrix through the function corr() written in the pandas library in python:

features = data[["area","dirction","floor","bed","sitting","bath","price"] ]
correlation_matrix = features.corr()
print(correlation_matrix)

But because this looks too ugly, not fancy enough, and the relationship between them cannot be clearly seen, so next we draw a heat map through the obtained correlation coefficient matrix:

heat map

plt.figure(figsize=(8, 6))
matplotlib.rc("font",family='YouYuan')
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Feature correlation coefficient matrix heat map')
plt.show()

In this way, the correlation between them can be clearly seen at a glance, and it is also more beautiful.

Machine learning model

Univariate linear regression

Univariate linear regression is a method of analyzing the linear correlation between only one independent variable (independent variable x and dependent variable y). The value of an economic indicator is often affected by many factors. If only one factor is main and plays a decisive role, linear regression can be used for prediction analysis.

We first model the relationship between area and price

x = data.loc[:,"area"]
y = data.loc[:,"price"]

Import LinearRegression library

from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()

Transform data dimensions

x = np.array(x)
x = x.reshape(-1,1)
y = np.array(y)
y = y.reshape(-1,1)
print(type(x),x.shape,type(y),y.shape)

Train the model and get the predicted values and coefficients and intercepts of the one-variable linear model:

lr_model.fit(x,y)
y_predict = lr_model.predict(x)
print(y_predict)
a = lr_model.coef_
b = lr_model.intercept_
print(a,b)

Use the R2 score to see how accurate your model is

from sklearn.metrics import mean_squared_error,r2_score
MSE = mean_squared_error(y,y_predict)
R2 = r2_score(y,y_predict)
print(MSE,R2)

But the model accuracy is not very ideal, only 0.66.

Multiple linear regression

So I further chose the Multiple linear regression model for regression

X_multi = data.drop(['price'],axis=1)
X_multi

The same linear regression code is almost the same, except that there are some more x parameters:

LR_multi = LinearRegression()
LR_multi.fit(X_multi,y)

Getpredicted values:

y_predict_multi = LR_multi.predict(X_multi)
print(y_predict_multi)

Conduct model evaluation again:

mean_squared_error_multi = mean_squared_error(y,y_predict_multi)
r2_score_multi = r2_score(y,y_predict_multi)
print(mean_squared_error_multi,r2_score_multi)

But the result is still not very friendly, it just improved from 0.66 to 0.69. I was wondering if this data set is not suitable for linear regression, but the question requires using Linear regression, is there something wrong with the data set? In order to verify the rationality of linear regression, I chose to change to a machine learning modelrandom forest Algorithmremodel and the results of the two models arecompared.

Random Forest Regression

The idea of random forest

Random forest belongs to ensemble learning. Its core idea is to integrate multiple weak classifiers to achieve the effect of three stooges outperforming Zhuge Liang. Random forest adopts the idea of Bagging. The so-called Bagging is:

(1) Take n training samples from the training set with replacement each time to form a new training set;

(2) Use the new training set to train M sub-models;

(3) For classification problems, the voting method is used, and the classification category of the sub-model with the most votes is the final category; for regression problems, a simple average method is used to obtain >Predicted value.

Multiple Decision Trees are combined together, each time the data set is randomly selected with replacement, and some features are randomly selected as input, so the algorithm is called a random forest algorithm. It can be seen that the random forest algorithm is a bagging algorithm using decision trees as estimators.

Next, attach the code directly:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

#Extract features and labels
X = data[["area","dirction","floor","bed","sitting","bath"]] # Features
y = data['price'] # label

# Divide training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Create a random forest regression model
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

#Train model
rf_regressor.fit(X_train, y_train)

# Use the model to make predictions
y_pred = rf_regressor.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Square Error (MSE):", mse)
print("Coefficient of determination (R^2):", r2)

Conclusion

The final R2 score obtained through random forest regression is still 0.67, so I think it should be a problem with the data set, not the model. There will be no further changes in the future. When it comes out, I will send uumeng specific improvement methods.

The above is the entire process of my big assignment. I hope it can be helpful to uumeng.