repeating and splitting list of dataframes - python

I have a dataset of 150 rows.
I need to split the main dataframe into equal sized overlapping parts. In this case 12, but could be 24 for another data set.
Right now I just repeat this code.. but for a large dataset it takes too much time.
# df1 = df_all_sales.iloc[0:12].. df2 = df_all_sales.iloc[1:13].. and on and on
df1 = pd.DataFrame(df_all_sales, columns=['time', 'sales-transaction']).iloc[0:12]
df2 = pd.DataFrame(df_all_sales, columns=['time', 'sales-transaction']).iloc[1:13]
df3 = pd.DataFrame(df_all_sales, columns=['time', 'sales-transaction'])iloc[2:14]
df4 = pd.DataFrame(df_all_sales, columns=['time', 'sales-transaction']).iloc[3:15]
Is there a good way to simplify this or make the dfs more automatic?
It need to be easy to access the different dataframes also.
HELP! :)

You can do it by making the dataframes created in a dictionaries, you can change 'k' value as you want:
k=12
j=0
d={}
for i in range(0,len(df_all_sales)-k,k):
d[j] = pd.DataFrame(df_all_sales, columns=['time', 'sales-transaction']).iloc[i:i+k]
j=j+1
you can access it now , by d[0],d[1]... , until j or number of dataframes created.
if you want to access its elements you can use its index ,for example:
d[0].iloc[0]
else:
if you want elements from specific column : as an example
d[0].time # for the whole column
d[0].time.iloc[0] # for that specific element
d[0].time.loc[0:5] # for a range

Related

How to split a dictionary of df in half using pandas?

I have a very large dictionary of dataframes. It contains around 250 dataframes, each of which has around 50 columns per df. My goal is to concat the dataframes to create one large df; however, as you can imagine, this process isn't great because it will create a df that is way too large view outside of using python.
My goal is to explode the large dictionary of df in half and turn it into two large, but manageable files.
I will try to replicate what it looks like:
d = {df1, df2,........,df500}
df = pd.concat(d)
# However, Is there a way to split 50%?
df1 = pd.concat(d) # only gets first 250 of the df
df2 =pd.concat(d) # only gets last 250 df
How about something like this?
v = list(d.values())
part1 = v[:len(v)//2]
part2 = v[len(part1):]
df1 = pd.concat(part1)
df2 = pd.concat(part2)
First of all it's not a dictionary , it's a set which can be converted to list.
An List can be divided into 2 as you need.
d=list(d)
ln=len(d)
d1=d[0:ln//2]
d2=d[ln//2:]
df1 = pd.concat(d1)
df2 = pd.concat(d2)

I'm trying to merge a small dataframe to another large one, looping through the small dataframes

I am able to print the small dataframe and see it is being generated correctly, I've written it using the code below. My final result however contains just the result of the final merge, as opposed to passing over each one and merging them.
MIK_Quantiles is the first larger dataframe, df2_t is the smaller dataframe being generated in the while loop. The dataframes are both produced correctly and the merge works, but I'm left with just the result of the very last merge. I want it to merge the current df2_t with the already merged result (df_merged) of the previous loop. I hope this makes sense!
i = 0
while i < df_length - 1:
cur_bound = MIK_Quantiles['bound'].iloc[i]
cur_percentile = MIK_Quantiles['percentile'].iloc[i]
cur_bin_low = MIK_Quantiles['auppm'].iloc[i]
cur_bin_high = MIK_Quantiles['auppm'].iloc[i+1]
### Grades/Counts within bin, along with min and max
df2 = df_orig['auppm'].loc[(df_orig['bound'] == cur_bound) & (df_orig['auppm'] >= cur_bin_low) & (df_orig['auppm'] < cur_bin_high)].describe()
### Add fields of interest to the output of describe for later merging together
df2['bound'] = cur_bound
df2['percentile'] = cur_percentile
df2['bin_name'] = 'bin name'
df2['bin_lower'] = cur_bin_low
df2['bin_upper'] = cur_bin_high
df2['temp_merger'] = str(int(df2['bound'])) + '_' + str(df2['percentile'])
# Write results of describe to a CSV file and transpose columns to rows
df2.to_csv('df2.csv')
df2_t = pd.read_csv('df2.csv').T
df2_t.columns = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max', 'bound', 'percentile', 'bin_name', 'bin_lower', 'bin_upper', 'temp_merger']
# Merge the results of the describe on the selected data with the table of quantile values to produce a final output
df_merged = MIK_Quantiles.merge(df2_t, how = 'inner', on = ['temp_merger'])
pd.merge(df_merged, df2_t)
print(df_merged)
i = i + 1
Your loop does not do anything meaningful, other than increment i.
You do a merge of 2 (static) dfs (MIK_Quantiles and df2_t), and you do that df_length number of times. Everytime you do that (first, i-th, and last iteration of the loop), you overwrite the output variable df_merged.
To keep in the output whatever has been created in the previous loop iteration, you need to concat all the created df2_t:
df2 = pd.concat([df2, df2_t]) to 'append' the newly created data df2_t to an output dataframe df2 during each iteration of the loop, so in the end all the data will be contained in df2
Then, after the loop, merge that one onto MIK_Quantiles
pd.merge(MIK_Quantiles, df2) (not df2_t (!)) to merge on the previous output
df2 = pd.DataFrame([]) # initialize your output
for i in range(0, df_length):
df2_t = ... # read your .csv files
df2 = pd.concat([df2, df2_t])
df2 = ... # do vector operations on df2 (process all of the df2_t at once)
out = pd.merge(MIK_Quantiles, df2)

Populating a column based off of values in another column

Hi I am working with pandas to manipulate some lab data. I currently have a data frame with 5 columns.
The first three columns(Analyte,CAS NO(1), and Value) are in the correct order.
The last two columns(CAS NO 2 and Value 2) are not.
Is there a way to align CAS No(2) and Value(2) with the first three columns based off of matching CAS Numbers(aka CAS NO(2)=CAS(NO1).
I am new to python and pandas. Thank you for your help
you can reorder the columns by reassigning the df variable as a slice of itself indexed on a list whose entries are the column names in question.
colidx = ['Analyte', 'CAS NO(1)', 'CAS NO(2)']
df = df[colidx]
Better provide input data in text format so we can copy-paste it. I understand you question like this: You need to sort two last columns together, so that CAS NO(2) matches CAS NO(1).
Since CAS NO(2)=CAS(NO1) you then do not need duplicated CAS NO(2) column, right?
Split off two last columns and make a Series from it, then convert that series to dict, and use that dict to map new values.
# Split 2 last columns and assign index.
df_tmp = df[['CAS NO(2)', 'Value(2)']]
df_tmp = df_tmp.set_index('CAS NO(2)')
# Keep only 3 first columns of original dataframe
df = df[['Analyte',' CASNo(1)', 'Value(1)']]
# Now copy the CasNO(1) to CAS NO(2)
df['CAS NO(2)'] = df['CasNO(1)']
# Now create Value(2) column on original dataframe
df['Value(2)'] = df['CASNo(1)'].map(df_tmp.to_dict()['Value(2)'])
Try the following:
import pandas as pd
import numpy as np
#create an example of your table
list_CASNo1 = ['71-43-2', '100-41-4', np.nan, '1634-04-4']
list_Val1 = [np.nan]*len(list_CASNo1)
list_CASNo2 = [np.nan, np.nan, np.nan, '100-41-4']
list_Val2 = [np.nan, np.nan, np.nan, '18']
df = pd.DataFrame(zip(list_CASNo1, list_Val1, list_CASNo2, list_Val2), columns =['CASNo(1)','Value(1)','CAS NO(2)','Value(2)'], index = ['Benzene','Ethylbenzene','Gasonline Range Organics','Methyl-tert-butyl ether'])
#split the data to two dataframes
df1 = df[['CASNo(1)','Value(1)']]
df2 = df[['CAS NO(2)','Value(2)']]
#merge df2 to df1 based on the specified columns
#reset_index and set_index will take care
#that df_adjusted will have the same index names as df1
df_adjusted = df1.reset_index().merge(df2.dropna(),
how = 'left',
left_on = 'CASNo(1)',
right_on = 'CAS NO(2)').set_index('index')
but be careful with duplicates in your columns, those will cause the merge to fail..

A fast method to add a label column to large pd dataframe based on a range of another column

I'm fairly new to python and am working with large dataframes with upwards of 40 million rows. I would like to be able to add another 'label' column based on the value of another column.
if I have a pandas dataframe (much smaller here for detailing the problem)
import pandas as pd
import numpy as np
#using random to randomly get vals (as my data is not sorted)
my_df = pd.DataFrame(np.random.randint(0,100,1000),columns = ['col1'])
I then have another dictionary containing ranges associated with a specific label, similar to something like:
my_label_dict ={}
my_label_dict['label1'] = np.array([[0,10],[30,40],[50,55]])
my_label_dict['label2'] = np.array([[11,15],[45,50]])
Where any data in my_df should be 'label1' if it is between 0,10 or 30,40 or 50,55
And any data should be 'label2' if it between 11 to 15 or 45 to 50.
I have only managed to isolate data based on the labels and retrieve an index through something like:
idx_save = np.full(len(my_label_dict['col1']),False,dtype = bool).reshape(-1,1)
for rng in my_label_dict['label1']:
idx_temp = np.logical_and( my_label_dict['col1']> rng[0], my_label_dict['col1'] < rng[1]
idx_save = idx_save | idx_temp
and then use this index to access label1 values from my_dict. and then repeat for label2.
Ideally I would like to add another column on my_label_dict named 'labels' which would add 'label1' for all datapoints that satisfy the given ranges etc. Or just a quick method to retrieve all values from the dataframe that satisfy the ranges in the labels.
I'm new to generator functions, and havent completely gotten my head around them but maybe they could be used here?
Thanks for any help!!
You can to the task in a "more pandasonic" way.
Start from creating a Series, named labels, initially with empty strings:
labels = pd.Series([''] * 100).rename('label')
The length is 100, just as the upper limit of your values.
Then fill it with proper labels:
for key, val in my_label_dict.items():
for v in val:
labels[v[0]:v[1]+1] = key
And the only thing to do is to merge your DataFrame with labels:
my_df = my_df.merge(labels, how='left', left_on='col1', right_index=True)
I also noticed such a contradiction in my_label_dict:
you have label1 for range between 50 and 55 (I assume inclusive),
you have also label2 for range between 45 and 50,
so for value of 50 you have two definitions.
My program acts on the "last decision takes precedence" principle, so the label
for 50 is label2. Maybe you should change one of these range borders?
Edit
A modified solution if the upper limit of col1 is "unpredictable":
Define labels the following way:
rngMax = max(np.array(list(itertools.chain.from_iterable(
my_label_dict.values())))[:,1])
labels = pd.Series([np.nan] * (rngMax + 1)).rename('label')
for key, val in my_label_dict.items():
for v in val:
labels[v[0]:v[1]+1] = key
labels.dropna(inplace=True)
Add .fillna('') to my_df.merge(...).
Here is a solution that would also work for float ranges, where you can't create a mapping for all possible values. This solution requires resorting your dataframes.
# build a dataframe you can join and sort it for the from-field
join_df=pd.DataFrame({
'from': [ 0, 30, 50, 11, 45],
'to': [10, 40, 55, 15, 50],
'label': ['label1', 'label1', 'label1', 'label2', 'label2']
})
join_df.sort_values('from', axis='index', inplace=True)
# calculate the maximum range length (but you could alternatively set it to any value larger than your largest range as well)
max_tolerance=(join_df['to'] - join_df['from']).max()
# sort your value dataframe for the column to join on and do the join
my_df.sort_values('col1', axis='index', inplace=True)
result= pd.merge_asof(my_df, join_df, left_on='col1', right_on='from', direction='backward', tolerance=max_tolerance)
# now you just have to remove the lables for the rows for which the value passed the end of the range and drop the two range columns
result.loc[result['to']<result['col1'], 'label']= np.NaN
result.drop(['from', 'to'], axis='columns', inplace=True)
The merge_asof(...direchtion='backward',...) just joines for each row in my_df the row in join_df with the maximum value in from that still sattisfies from<=col1. It doesn't look at the to column at all. This is why we remove the labels where the to boundary is hurt by the assignment of np.NaN in the line with the .loc.

My dataframe has many (192) columns. How to select two columns at time?

My dataframe is like df.columns= ['Time1','Pmpp1','Time2',..........,'Pmpp96'] I want to select two successive columns at a time. Example, Time1,Pmpp1 at a time.
My code is:
for i,j in zip(df.columns,df.columns[1:]):
print(i,j)
My present output is:
Time1 Pmmp1
Pmmp1 Time2
Time2 Pmpp2
Expected output is:
Time1 Pmmp1
Time2 Pmpp2
Time3 Pmpp3
You're zipping on the list, and the same list starting from the second element, which is not what you want. You want to zip on the uneven and even indices of your list. For example, you could replace your code with:
for i, j in zip(df.columns[::2], df.columns[1::2]):
print(i, j)
As an alternative to integer positional slicing, you can use str.startswith to create 2 index objects. Then use zip to iterate over them pairwise:
df = pd.DataFrame(columns=['Time1', 'Pmpp1', 'Time2', 'Pmpp2', 'Time3', 'Pmpp3'])
times = df.columns[df.columns.str.startswith('Time')]
pmpps = df.columns[df.columns.str.startswith('Pmpp')]
for i, j in zip(times, pmpps):
print(i, j)
Time1 Pmpp1
Time2 Pmpp2
Time3 Pmpp3
In this kind of scenario, it might make sense to reshape your DataFrame. So instead of selecting two columns at a time, you have a DataFrame with the two columns that ultimately represent your measurements.
First, you make a list of DataFrames, where each one only has a Time and Pmpp column:
dfs = []
for i in range(1,97):
tmp = df[['Time{0}'.format(i),'Pmpp{0}'.format(i)]]
tmp.columns = ['Time', 'Pmpp'] # Standardize column names
tmp['n'] = i # Remember measurement number
dfs.append(tmp) # Keep with our cleaned dataframes
And then you can join them together into a new DataFrame. That has three columns.
new_df = pd.concat(dfs, ignore_index=True, sort=False)
This should be a much more manageable shape for your data.
>>> new_df.columns
[n, Time, Pmpp]
Now you can iterate through the rows in this DataFrame and get the values for your expected output
for i, row in new_df.iterrows():
print(i, row.n, row.Time, row.Psmpp)
It also will make it easier to use the rest of pandas to analyze your data.
new_df.Pmpp.mean()
new_df.describe()
After a series of trials, I got it. My code is given below:
for a in range(0,len(df.columns),2):
print(df.columns[a],df.columns[a+1])
My output is:
DateTime A016.Pmp_ref
DateTime.1 A024.Pmp_ref
DateTime.2 A040.Pmp_ref
DateTime.3 A048.Pmp_ref
DateTime.4 A056.Pmp_ref
DateTime.5 A064.Pmp_ref
DateTime.6 A072.Pmp_ref
DateTime.7 A080.Pmp_ref
DateTime.8 A096.Pmp_ref
DateTime.9 A120.Pmp_ref
DateTime.10 A124.Pmp_ref
DateTime.11 A128.Pmp_ref

Categories

Resources