I have a question about performance improvement of the following code:
df["range_column"] = list(zip(df.START, df1.END))
df["range_col"] = df["range_col"].swifter.apply(lambda x: pd.date_range(x[0], x[1], freq="60min"))
Explanation: I have two datetime columns. Based on these columns I create a tuple and a data range of 60 minutes.
For larger datasets, it takes quite a long time to run this code.
Below I have created some sample data to run the code.
Does anyone perhaps know of an alternative that produces the same result but is faster?
import faker
import pandas as pd
from faker import Faker
import swifter
# create some fake date data
fake = Faker()
Faker.seed(0)
df = []
for _ in range(5):
df.append(fake.date("%Y-%m-%d_%H_%M_%S"))
df1 = []
for _ in range(5):
df1.append(fake.date("%Y-%m-%d_%H_%M_%S"))
# create df
df = pd.DataFrame(df)
df["START"] = df
df = pd.DataFrame(df["START"])
df["START"] = pd.to_datetime(df["START"], format="%Y-%m-%d_%H_%M_%S")
# create df
df1 = pd.DataFrame(df1)
df1["END"] = df1
df1 = pd.DataFrame(df1["END"])
df1["END"] = pd.to_datetime(df1["END"], format="%Y-%m-%d_%H_%M_%S")
# merge
df2 = pd.concat([df, df1], axis = 1)
# create tuple
df2["range_col"] = list(zip(df2.START, df2.END))
# create date range
df2["range__col1"] = df2["range_col"].swifter.apply(lambda x: pd.date_range(x[0], x[1], freq="60min"))
Related
please help me to solve this, How to make new column in df with duration result? also result for all row. Thanks.
import pandas as pd
from datetime import time,datetime
from itertools import repeat
df = pd.read_csv("data.csv")
df['startdate_column'] = pd.to_datetime(df['startdate_column'])
df['enddate_column'] = pd.to_datetime(df['enddate_column'])
start_time=time(8,0,0)
end_time=time(17,0,0)
unit='min'
df['Duration'] = list(map(businessDuration,startdate=df['startdate_column'],enddate=df['enddate_column'],repeat(start_time),repeat(end_time),repeat(weekendlist=[6]),repeat(unit)))```
Use:
f = lambda x: businessDuration(startdate=x['startdate_column'],
enddate=x['enddate_column'],
starttime=start_time,
endtime=end_time,
weekendlist=[6],
unit=unit)
df['Duration'] = df.apply(f, axis=1)
I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))
df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
To fix this I have tried:
dfAnimal = df.copy
dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)
dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)
df = pd.concat(dfAnimal,dfFish)
The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?
Try creating all your new columns in a dict, and then convert that dict into a dataframe, and then use pd.concat to add the resulting dataframe (containing the new columns) to the original dataframe:
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
Added to your original code, it would be something like this:
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
##### These are the new lines #####
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
##### End of new lines #####
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
I have the following df:
df = pd.DataFrame(columns=['Place', 'PLZ','shortName','Parzellen'])
new_row1 = {'Place':'Winterthur', 'PLZ':[8400, 8401, 8402, 8404, 8405, 8406, 8407, 8408, 8409, 8410, 8411], 'shortName':'WIN', 'Parzellen':[]}
new_row2 = {'Place':'Opfikon', 'PLZ':[8152], 'shortName':'OPF', 'Parzellen':[]}
new_row3 = {'Place':'Stadel', 'PLZ':[8174], 'shortName':'STA', 'Parzellen':[]}
new_row4 = {'Place':'Kloten', 'PLZ':[8302], 'shortName':'KLO', 'Parzellen':[]}
new_row5 = {'Place':'Niederhasli', 'PLZ':[8155,8156], 'shortName':'NIH', 'Parzellen':[]}
new_row6 = {'Place':'Bassersdorf', 'PLZ':[8303], 'shortName':'BAS', 'Parzellen':[]}
new_row7 = {'Place':'Oberglatt', 'PLZ':[8154], 'shortName':'OBE', 'Parzellen':[]}
new_row8 = {'Place':'Bülach', 'PLZ':[8180], 'shortName':'BUE', 'Parzellen':[]}
df = df.append(new_row1, ignore_index=True)
df = df.append(new_row2, ignore_index=True)
df = df.append(new_row3, ignore_index=True)
df = df.append(new_row4, ignore_index=True)
df = df.append(new_row5, ignore_index=True)
df = df.append(new_row6, ignore_index=True)
df = df.append(new_row7, ignore_index=True)
df = df.append(new_row8, ignore_index=True)
print (df)
Now I have a number like 8405 and I want to know the Place or whole Row which has this number under df['PLZ'].
I also tried with classes but it was hard to get all Numbers of all Objects because I want to be able to call all PLZ in a list and also check, if I have any number, to which Place it belongs. Maybe there is an obvious better way and I just don't know it.
try with boolean masking and map() method:
df[df['PLZ'].map(lambda x:8405 in x)]
OR
via boolean masking and agg() method:
df[df['PLZ'].agg(lambda x:8405 in x)]
#you can also use apply() in place of agg
output of above code:
Place PLZ shortName Parzellen
0 Winterthur [8400, 8401, 8402, 8404, 8405, 8406, 8407, 840... WIN []
I was wondering whether somebody could please give me some assistance with the Pandas iterrows package.
I'm currently using an iterative function which works but I was wondering whether using iterrows would make it more efficient to avoid a for loop?
import pandas as pd
import numpy as np
dataframe_1 = pd.read_csv("D\data\2018_19.csv")
def append_date_column(df):
df = df.copy()
df['date_column'] = np.nan
date_range = pd.date_range(start = '01/01/2001', periods = 207, freq = 'M').values
for row in range(df.shape[0]):
date_number = df.loc[row, "income2"]
if (not pd.isna(date_number)) and date_number < 207:
date = date_range[int(date_number) -1]
df.loc[row, 'date_column'] = date
df_with_date_column = df
return df_with_date_column
Thanks!
I am trying to combine a series of stock tick data based on the dates.
But it wont work. Please help.
import pandas as pd
import tushare as ts
def get_all_tick(stockID):
dates=pd.date_range('2016-01-01',periods=5,freq='D')
append_data=[]
for i in dates:
stock_tick=pd.DataFrame(ts.get_tick_data(stockID,date=i))
stock_tick.sort('volume',inplace=True, ascending=False)
stock_tick=stock_tick[:10]
stock_tick.sort('time',inplace=True, ascending=False)
append_data.append(stock_tick.iterrows())
get_all_tick('300243')
I figure it out myself.
def get_all_tick(stockID):
.........
df = pd.DataFrame()
for i in get_date:
stock_tick = ts.get_tick_data(stockID, date=i)
stock_tick['Date']=i
stock_tick.sort('volume', inplace=True, ascending=False)
stock_tick = stock_tick[:10]
stock_tick.sort('time', inplace=True, ascending=False)
df = df.append(stock_tick)
df.to_excel('tick.xlsx',sheet_name='Sheet1')
get_all_tick('300243')