Pandas apply

Pandas is a very useful for data processing with the Python language, it contains many useful data manipulation methods. Many algorithm-related library functions require pandas data as input data structure.

You can parse all kinds of data including CSV, MS Excel, JSON, HTML and a lot more. These are some of the best functions to use with Python pandas.

Related course: Data Analysis with Python Pandas

Supported formats

You can load all types of formats, an overview is shown below including the function to read a dataframe and to write it.

Format Type Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq

Apply function

After reading the data, there are many useful functions for data processing, but I think the best one to use is the apply function.

This function is as follows.

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

The most useful part of this function is the first argument, which is a function, equivalent to a function pointer in C/C++.

This function needs to be implemented by itself. The incoming parameters of the function are determined according to axis, for example, axis = 1, it will pass a line of data as the data structure of Series to the function implemented by itself.

You apply the function to the data frame. The program below shows an example of its usage:

import pandas as pd
import datetime   

def dataInterval(data1,data2):
    d1 = datetime.datetime.strptime(data1, '%Y-%m-%d')
    d2 = datetime.datetime.strptime(data2, '%Y-%m-%d')
    delta = d1 - d2
    return delta.days

def getInterval(arrLike):  
    PublishedTime = arrLike['PublishedTime']
    ReceivedTime = arrLike['ReceivedTime']
    days = dataInterval(PublishedTime.strip(),ReceivedTime.strip())  
    return days

if __name__ == '__main__':    
    fileName = "new.xls";
    df = pd.read_excel(fileName) 
    df['TimeInterval'] = df.apply(getInterval , axis = 1)