Pandas apply
Pandas is a very useful for data processing with the Python language, it contains many useful data manipulation methods. Many algorithm-related library functions require pandas data as input data structure.
You can parse all kinds of data including CSV, MS Excel, JSON, HTML and a lot more. These are some of the best functions to use with Python pandas.
Related course: Data Analysis with Python Pandas
Supported formats
You can load all types of formats, an overview is shown below including the function to read a dataframe and to write it.
Format | Type | Reader | Writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | Local clipboard | read_clipboard | to_clipboard |
binary | MS Excel | read_excel | to_excel |
binary | HDF5 Format | read_hdf | to_hdf |
binary | Feather Format | read_feather | to_feather |
binary | Parquet Format | read_parquet | to_parquet |
binary | Msgpack | read_msgpack | to_msgpack |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | Python Pickle Format | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQL | Google Big Query | read_gbq | to_gbq |
Apply function
After reading the data, there are many useful functions for data processing, but I think the best one to use is the apply function.
This function is as follows.
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
The most useful part of this function is the first argument, which is a function, equivalent to a function pointer in C/C++.
This function needs to be implemented by itself. The incoming parameters of the function are determined according to axis, for example, axis = 1, it will pass a line of data as the data structure of Series to the function implemented by itself.
You apply the function to the data frame. The program below shows an example of its usage:
import pandas as pd
import datetime
def dataInterval(data1,data2):
d1 = datetime.datetime.strptime(data1, '%Y-%m-%d')
d2 = datetime.datetime.strptime(data2, '%Y-%m-%d')
delta = d1 - d2
return delta.days
def getInterval(arrLike):
PublishedTime = arrLike['PublishedTime']
ReceivedTime = arrLike['ReceivedTime']
days = dataInterval(PublishedTime.strip(),ReceivedTime.strip())
return days
if __name__ == '__main__':
fileName = "new.xls";
df = pd.read_excel(fileName)
df['TimeInterval'] = df.apply(getInterval , axis = 1)