How to split a dictionary of df in half using pandas? - python

I have a very large dictionary of dataframes. It contains around 250 dataframes, each of which has around 50 columns per df. My goal is to concat the dataframes to create one large df; however, as you can imagine, this process isn't great because it will create a df that is way too large view outside of using python.
My goal is to explode the large dictionary of df in half and turn it into two large, but manageable files.
I will try to replicate what it looks like:
d = {df1, df2,........,df500}
df = pd.concat(d)
# However, Is there a way to split 50%?
df1 = pd.concat(d) # only gets first 250 of the df
df2 =pd.concat(d) # only gets last 250 df

How about something like this?
v = list(d.values())
part1 = v[:len(v)//2]
part2 = v[len(part1):]
df1 = pd.concat(part1)
df2 = pd.concat(part2)

First of all it's not a dictionary , it's a set which can be converted to list.
An List can be divided into 2 as you need.
d=list(d)
ln=len(d)
d1=d[0:ln//2]
d2=d[ln//2:]
df1 = pd.concat(d1)
df2 = pd.concat(d2)

Related

repeating and splitting list of dataframes

I have a dataset of 150 rows.
I need to split the main dataframe into equal sized overlapping parts. In this case 12, but could be 24 for another data set.
Right now I just repeat this code.. but for a large dataset it takes too much time.
# df1 = df_all_sales.iloc[0:12].. df2 = df_all_sales.iloc[1:13].. and on and on
df1 = pd.DataFrame(df_all_sales, columns=['time', 'sales-transaction']).iloc[0:12]
df2 = pd.DataFrame(df_all_sales, columns=['time', 'sales-transaction']).iloc[1:13]
df3 = pd.DataFrame(df_all_sales, columns=['time', 'sales-transaction'])iloc[2:14]
df4 = pd.DataFrame(df_all_sales, columns=['time', 'sales-transaction']).iloc[3:15]
Is there a good way to simplify this or make the dfs more automatic?
It need to be easy to access the different dataframes also.
HELP! :)
You can do it by making the dataframes created in a dictionaries, you can change 'k' value as you want:
k=12
j=0
d={}
for i in range(0,len(df_all_sales)-k,k):
d[j] = pd.DataFrame(df_all_sales, columns=['time', 'sales-transaction']).iloc[i:i+k]
j=j+1
you can access it now , by d[0],d[1]... , until j or number of dataframes created.
if you want to access its elements you can use its index ,for example:
d[0].iloc[0]
else:
if you want elements from specific column : as an example
d[0].time # for the whole column
d[0].time.iloc[0] # for that specific element
d[0].time.loc[0:5] # for a range

Concatenate two pandas dataframe and follow a sequence of uid

I have a pandas dataframe with the following data: (in csv)
#list1
poke_id,symbol
0,BTC
1,ETB
2,USDC
#list2
5,SOL
6,XRP
I am able to concatenate them into one dataframe using the following code:
df = pd.concat([df1, df2], ignore_index = True)
df = df.reset_index(drop = True)
df['poke_id'] = df.index
df = df[['poke_id','symbol']]
which gives me the output: (in csv)
poke_id,symbol
0,BTC
1,ETB
2,USDC
3,SOL
4,XRP
Is there any other way to do the same. I think calling the whole data frame of ~4000 entries just to add ~100 more will be a little pointless and cumbersome. How can I make it in such a way that it picks list 1 (or dataframe 1) and picks the highest poke_id; and just does i + 1 to the later entries in list 2.
Your solution is good, is possible simplify:
df = pd.concat([df1, df2], ignore_index = True).rename_axis('poke_id').reset_index()
use indexes to get what data you want from the dataframe, although this is not effective if you want large amounts of data from the dataframe, this method allows you to take specific amounts of data from the dataframe

Pandas - contains from other DF

I have 2 dataframes:
DF A:
and DF B:
I need to check every row in the DFA['item'] if it contains some of the values in the DFB['original'] and if it does, then add new column in DFA['my'] that would correspond to the value in DFB['my'].
So here is the result I need:
I tought of converting the DFB['original'] into list and then use regex, but this way I wont get the matching result from column 'my'.
Ok, maybe not the best solution, but it seems to be working.
I did cartesian join and then check the records which contains the data needed
dfa['join'] = 1
dfb['join'] = 1
dfFull = dfa.merge(dfb, on='join').drop('join' , axis=1)
dfFull['match'] = dfFull.apply(lambda x: x.original in x.item, axis = 1)
dfFull[dfFull['match']]

Finding mean of consecutive column data

I have the following data:
(the data given here is just representational)
`
I want to do the following with this data:
I want to get column only after the 201
i.e. I want to remove the 200-1 to 200-4 column data.
One way to do this is to retrieve only the required column while reading the data from excel, but I want to know how we can filter the column name on the basis of a particular pattern as 200-1 to 200-4 column name has pattern 200-*
I want to make a column after 202-4 which stores the values in the following ways:
201q1= mean of (201-1 and 201-2)
201q2 = mean of(201-3 and 201-4)
Similarly, if 202-1 to 201-4 data would have been there, a similar column should have been formed.
Please help.
Thanks in advance for your support.
This is a rough example but it will get you close. The example assume that there are always four columns per group:
#sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randn(2,12), columns=['200-1','200-2','200-3','200-4', '201-1', '201-2', '201-3','201-4', '202-1', '202-2', '202-3','202-4'])
# remove 200-* columns
df2 = df[df.columns[~df.columns.str.contains('200-')]]
# us np.arange to create groups
new = df2.groupby(np.arange(len(df2.columns))//2, axis=1).mean()
# rename columns
new.columns = [f'{v}{k}' for v,k in zip([x[:3] for x in df2.columns[::2]], ['q1','q2']*int(len(df2.columns[::2])/2))]
# join
df2.join(new)
201-1 201-2 201-3 201-4 202-1 202-2 202-3 \
0 0.865408 -2.301539 1.744812 -0.761207 0.319039 -0.249370 1.462108
1 -0.172428 -0.877858 0.042214 0.582815 -1.100619 1.144724 0.901591
202-4 201q1 201q2 202q1 202q2
0 -2.060141 -0.718066 0.491802 0.034834 -0.299016
1 0.502494 -0.525143 0.312514 0.022052 0.702043
For step 1, you can get away with list comprehension, and the pandas drop function:
dropcols = [x for x in df.columns if '200-' in x]
df.drop(dropcols, axis=1, inplace=True)
Steps 3 and 4 are similar, you could calculate the rolling mean of the columns:
df2 = df.rolling(2, axis = 1).mean() # creates rolling mean
df2.columns = [x.replace('-', 'q') for x in df2.columns] # renames the columns
dfans = pd.concat([df, df2], axis = 1) # concatenate the columns together
Now, you just need to remove the columns that you dont want and rename them.

How to take mean,std and mad of multiple dataframes that are appended in a list?

I've several hundred dataframes that are appended in a list. All the dataframes have same number of columns but the number of rows are not same. The column names are also same.
So i want to take the mean, mad, std of column value of each column and i'm doing something like this:
All the dataframes are appended in a list (lst)
lst = []
for filen, filen1 in zip(filelistn, filelist1):
df1 = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df2 = pd.read_table(path_to_files1+filen1, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
dfs = pd.merge(df1,df2, on='wave', how='inner')
dfs = df1 - df2
lst.append(dfs)
df = reduce(lambda x, y: pd.merge(x, y, on = 'wave',how='outer'), lst)
df = df.rename(columns = lambda x: x.split('_')[0]).T
df = df.groupby(df.index).agg(['mean','std','mad','median']).T
But the results that i'm getting are a bit weird, Like in column mad there are values like 21,65,36 which is absurd.
wave mean median mad
0 4050.32 -0.016182 -0.011940 0.008885
1 4208.98 0.023707 0.007189 0.032585
2 4374.94 -0.001321 -0.001196 0.000378
3 4379.74 0.002778 0.003380 0.004685
4 6828.60 -10.604568 -0.000590 21.084799
5 6839.84 -0.003466 -0.001870 0.010169
6 6842.04 -32.751551 -0.002514 65.118329
7 6842.69 18.293519 -0.002158 36.385884
The column wave is same in all the dataframes, but the number of rows are not. Does it has anything to do with that? May be it's taking the mean of the wrong rows?
Can anyone tell me how to solve this?
You can use pandas.concat to concatenate the sequence of data frames into one large data frame and calculate the statistics afterwards like so.
import pandas as pd
# lst = [construct list of dataframes ...]
df = pd.concat(lst, axis=0)
means = df.mean()
stds = df.std()
Edit: if you would like to get the statistics broken down by some key, e.g. wave, you can use the following.
means = df.groupby('wave').mean()

Categories

Resources