I am trying to combine hundreds of CSVs together in python using the following code:
import os
import pandas as pd
import glob
path = '/Users/parkerbolstad/Downloads/'
all_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f, sep=',') for f in all_files)
df_merged = pd.concat(df_from_each_file, axis=1, ignore_index=False)
df_merged.to_csv( "merged.csv")
But, this combines all the files together in totality. The first column of each file is dates. I want to pull the dates from the first file and then skip them for the rest.
As of now, I have a new column with dates in it every 4 columns.
Simply run for-loop to remove this column in all df except first - [1:]
for df in df_from_each_file[1:]:
df.drop('date', axis=1, inplace=True)
import pandas as pd
df1 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['X', 'Y', 'Z']
})
df3 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['1', '2', '3']
})
df_from_each_file = [df1, df2, df3]
for df in df_from_each_file[1:]:
df.drop('date', axis=1, inplace=True)
result = pd.concat(df_from_each_file, axis=1)
print(result)
Result:
date value value value
0 2021.08.01 A X 1
1 2021.08.02 B Y 2
2 2021.08.03 C Z 3
Or in all df convert column with date into index and later reset index.
This should correctly connect rows if dates will be in different rows or there will be missing dates.
import pandas as pd
df1 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['X', 'Y', 'Z']
})
df3 = pd.DataFrame({
'date': ['2021.08.01', '2021.08.02', '2021.08.03'],
'value': ['1', '2', '3']
})
df_from_each_file = [df1, df2, df3]
for df in df_from_each_file:
df.index = df['date']
df.drop('date', axis=1, inplace=True)
result = pd.concat(df_from_each_file, axis=1)
#result = result.sort_index()
result = result.reset_index()
print(result)
Related
Problem:
Given a large data set (3 million rows x 6 columns) what's the fastest way to join values of columns in a single pandas data frame, based on the rows where the mask is true?
My current solution:
import pandas as pd
import numpy as np
# Note: Real data will be 3 millon rows X 6 columns,
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
'd0': ['a', 'x', 'a', '1'],
'd1': ['b', 'x', 'b', '2'],
'd2': ['c', 'x', np.nan, '3']})
#print(df)
msg_text_filter = ['msg0', 'msg2']
columns = df.columns.drop(df.columns[:3])
column_join = ["d0"]
mask = df['msg'].isin(msg_text_filter)
df.replace(np.nan,'',inplace=True)
# THIS IS SLOW, HOW TO SPEED UP?
df['d0'] = np.where(
mask,
df[['d0','d1','d2']].agg(''.join, axis=1),
df['d0']
)
df.loc[mask, columns] = np.nan
print(df)
IMHO you can save a lot of time by using
df[['d0', 'd1', 'd2']].sum(axis=1)
instead of
df[['d0', 'd1', 'd2']].agg(''.join, axis=1)
And I think instead of using np.where you could just do:
df.loc[mask, 'd0'] = df.loc[mask, ['d0', 'd1', 'd2']].sum(axis=1)
I have a dataframe df (see image below) which I need to merge with N dataframes.
In this post, for the sake of clarity, N=3.
The goal is to check if every value of the column Id exists in the three other dataframes and have the same value as well. And if so, the row has to be highlighted with a green color. That's it !
Code :
import pandas as pd
import numpy as np
### --- Dataframes
df = pd.DataFrame({'Id' : ['AA', 'BB', 'CC', 'DD', 'EE'],
'Value': ['three', 'two', 'five','four', 'one']})
df1 = pd.DataFrame({'Id1' : [np.nan, 'CC', 'BB', 'DD', np.nan],
'Value1' : ['one', 'four', 'two', np.nan, np.nan]})
df2 = pd.DataFrame({'Id2' : ['AA', 'BB', 'CC', 'DD', 'JJ'],
'Value2' : [np.nan, 'two', 'five', np.nan, 'six']})
df3 = pd.DataFrame({'Id3' : ['FF', 'HH', 'CC', 'GG', 'BB'],
'Value3' : ['seven', 'five', 'one','three', 'two']})
### --- Joining df to df1, df2 and df3
df_df1 = df.merge(df1, left_on='Id', right_on='Id1', how='left')
df_df1_df2 = df_df1.merge(df2, left_on='Id', right_on='Id2', how='left')
df_df1_df2_df3 = df_df1_df2.merge(df3, left_on='Id', right_on='Id3', how='left')
### --- Creating a function to highlight the aligned rows
def highlight_aligned_row(x):
m1 = (x['Id'] == x['Id1']) & (x['Id'] == x['Id2']) & (x['Id'] == x['Id3'])
m2 = (x['Value'] == x['Value1']) & (x['Value']== x['Value2']) & (x['Value'] == x['Value3'])
df = pd.DataFrame('background-color: ', index=x.index, columns=x.columns)
df['Id'] = np.where(m1 & m2, f'background-color: green', df['Id'])
return df
>>> df_df1_df2_df3.style.apply(highlight_aligned_row, axis=None)
My questions are :
How do we highlight the entire row when a condition is fulfilled ?
Is there a more efficient way to merge 10 dataframes ?
How can we check if every value/row of the original dataframe is aligned with the values of the final dataframe (after the merge) ?
Thank you in advance for your suggestions and your help !
I would do it like this. Hope the comments in between make clear what I am doing. Hopefully, they also answer your questions, but let me know if anything remains unclear.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Id' : ['AA', 'BB', 'CC', 'DD', 'EE'],
'Value': ['three', 'two', 'five','four', 'one']})
df1 = pd.DataFrame({'Id1' : [np.nan, 'BB', 'CC', 'DD', np.nan],
'Value1' : ['one', 'two', 'four', np.nan, np.nan]})
df2 = pd.DataFrame({'Id2' : ['AA', 'BB', 'CC', 'DD', 'JJ'],
'Value2' : [np.nan, 'two', 'five', np.nan, 'six']})
df3 = pd.DataFrame({'Id3' : ['FF', 'BB', 'CC', 'GG', 'HH'],
'Value3' : ['seven', 'two', 'one','v4', 'v5']})
# *IF* your dfs (like above) are all same shape with same index, then easiest is to
# collect your dfs in a list and use pd.concat along axis 1 to merge:
# dfs = [df, df1, df2, df3]
# df_all = pd.concat(dfs, axis=1)
# *BUT* from your comments, this does not appear to be the case. Then instead,
# again collect dfs in a list, and merge them with df in a loop
dfs = [df, df1, df2, df3]
for idx, list_df in enumerate(dfs):
if idx == 0:
df_all = list_df
else:
df_all = df_all.merge(list_df, left_on='Id',
right_on=[col for col in list_df.columns
if col.startswith('Id')][0],
how='left')
def highlight_aligned_row(x):
n1 = x.loc[:,[col for col in x.columns
if col.startswith('Id')]].eq(x.loc[:, 'Id'], axis=0).all(axis=1)
m1 = x.loc[:,[col for col in x.columns
if col.startswith('Value')]].eq(x.loc[:, 'Value'], axis=0).all(axis=1)
eval_bool = n1 & m1
# Just for x['Id']: [False, True, False, False, False]
# repeat 8 times (== len(df.columns)) will lead to .shape == (40,).
# reshape to 5 rows (== len(df)) and 8 cols. Second row will be [8x True] now,
# other rows all 8x False
rows, cols = len(eval_bool), len(x.columns) # 5, 8
eval_bool_repeated = eval_bool.to_numpy().repeat(cols).reshape(rows,cols)
# setup your df
df = pd.DataFrame('background-color: ', index=x.index, columns=x.columns)
# now apply eval_bool_repeated to entire df, not just df['Id']
df = np.where(eval_bool_repeated, f'background-color: green', df)
return df
Result:
Below wide_to_long function can't work, anyone can help ? Thanks!
import pandas as pd
ori_df = pd.DataFrame()
ori_df = pd.DataFrame([['a', '1'], ['w:', 'z'], ['t', '6'], ['f:', 'z'], ['a', '2']],
columns=['type', 'value']
)
ori_df['id'] = ori_df.index
pd.wide_to_long(ori_df, ['type', 'value'], i='id', j='amount')
I have df after read_excel where some of values (from one column, with strings) are divided. How can i merge them back?
for example:
the df i have
{'CODE': ['A', None, 'B', None, None, 'C'],
'TEXT': ['A', 'a', 'B', 'b', 'b', 'C'],
'NUMBER': ['1', None, '2', None, None,'3']}
the df i want
{'CODE': ['A','B','C'],
'TEXT': ['Aa','Bbb','C'],
'NUMBER': ['1','2','3']}
I can't find the right solution. I tried to import data in different ways but it also did not help
You can forward fill missing values or Nones for groups with aggregate join and first non None value for NUMBER column:
d = {'CODE': ['A', None, 'B', None, None, 'C'],
'TEXT': ['A', 'a', 'B', 'b', 'b', 'C'],
'NUMBER': ['1', None, '2', None, None,'3']}
df = pd.DataFrame(d)
df1 = df.groupby(df['CODE'].ffill()).agg({'TEXT':''.join, 'NUMBER':'first'}).reset_index()
print (df1)
CODE TEXT NUMBER
0 A Aa 1
1 B Bbb 2
2 C C 3
You can generate dictionary:
cols = df.columns.difference(['CODE'])
d1 = dict.fromkeys(cols, 'first')
d1['TEXT'] = ''.join
df1 = df.groupby(df['CODE'].ffill()).agg(d1).reset_index()
I have two dataframes df1 and df2
df1 = pd.DataFrame({'name': ['A', 'B', 'C'],
'value': [100, 300, 150]})
df2 = pd.DataFrame({'name': ['A', 'B', 'D'],
'value': [20, 50, 7]})
I want to merge these two dataframes to a new dataframe df3 so I get the following result:
Then I want to have a forth new dataframe df4 where the rows aggregated to sums like
df4 = pd.DataFrame({'name': ['A', 'B', 'C', 'D'],
'value': [120, 350, 150, 7]})
How to do this?
You can concatenate the DataFrames together then use a groupby and sum:
df3 = pd.concat([df1, df2])
df4 = df3.groupby('name').sum().reset_index()
Result of df4:
name value
0 A 120
1 B 350
2 C 150
3 D 7
Another way is just append
df1.append(df2, ignore_index=True).groupby('name')['value'].sum().to_frame()
value
name
A 120
B 350
C 150
D 7