Pandas groupby and sum to other dataframe - python

I have a dictionary where key is a file name and values are dataframes that looks like:
col1 col2
A 10
B 20
A 20
A 10
B 10
I want to groupby based on 'col1' to sum values in 'col2' and store it to new dataframe 'df' whose output should look like:
The output should look like:
Index A B
file1 40 30
file2 50 35
My code:
df=pd.DataFrame(columns=['A','B'])
for key, value in data.items():
cnt=(value.groupby('Type')['Packets'].sum())
print(cnt)
df.append(cnt,ignore_index=True)

Another suggested way with group-by, transpose, and row stack into dataframe.
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'col1':['A', 'B', 'A', 'A', 'B'], 'col2':[10, 20, 20, 10, 10]})
df_2 = pd.DataFrame({'col1':['A', 'B', 'A', 'A', 'B'], 'col2':[30, 10, 15, 5, 25]})
df_1_agg = df_1.groupby(['col1']).agg({'col2':'sum'}).T.values
df_2_agg = df_2.groupby(['col1']).agg({'col2':'sum'}).T.values
pd.DataFrame(np.row_stack((df_1_agg, df_2_agg)), index = ['file1', 'file2']).rename(columns = {0:'A', 1:'B'})
Edited: to generalize, you need to put it into the function and loop through. Also, need to format the index (file{i}) for general cases.
lst_df = [df_1, df_2]
df_all = []
for i in lst_df:
# iterate every data faame
df_agg = i.groupby(['col1']).agg({'col2':'sum'}).T.values
# append to the accumulator
df_all.append(df_agg)
pd.DataFrame(np.row_stack(df_all), index = ['file1', 'file2']).rename(columns = {0:'A', 1:'B'})

You should try to avoid appending in a loop. This is inefficient and not recommended.
Instead, you can concatenate your dataframes into one large dataframe, then use pivot_table:
# aggregate values in your dictionary, adding a "file" series
df_comb = pd.concat((v.assign(file=k) for k, v in data.items()), ignore_index=True)
# perform 'sum' aggregation, specifying index, columns & values
df = df_comb.pivot_table(index='file', columns='col1', values='col2', aggfunc='sum')
Explanation
v.assign(file=k) adds a series file to each dataframe with value set to the filename.
pd.concat concatenates all the dataframes in your dictionary.
pd.DataFrame.pivot_table is a Pandas method which allows you to create Excel-style pivot tables via specifying index, columns, values and aggfunc (aggregation function).

Related

Populating a column based off of values in another column

Hi I am working with pandas to manipulate some lab data. I currently have a data frame with 5 columns.
The first three columns(Analyte,CAS NO(1), and Value) are in the correct order.
The last two columns(CAS NO 2 and Value 2) are not.
Is there a way to align CAS No(2) and Value(2) with the first three columns based off of matching CAS Numbers(aka CAS NO(2)=CAS(NO1).
I am new to python and pandas. Thank you for your help
you can reorder the columns by reassigning the df variable as a slice of itself indexed on a list whose entries are the column names in question.
colidx = ['Analyte', 'CAS NO(1)', 'CAS NO(2)']
df = df[colidx]
Better provide input data in text format so we can copy-paste it. I understand you question like this: You need to sort two last columns together, so that CAS NO(2) matches CAS NO(1).
Since CAS NO(2)=CAS(NO1) you then do not need duplicated CAS NO(2) column, right?
Split off two last columns and make a Series from it, then convert that series to dict, and use that dict to map new values.
# Split 2 last columns and assign index.
df_tmp = df[['CAS NO(2)', 'Value(2)']]
df_tmp = df_tmp.set_index('CAS NO(2)')
# Keep only 3 first columns of original dataframe
df = df[['Analyte',' CASNo(1)', 'Value(1)']]
# Now copy the CasNO(1) to CAS NO(2)
df['CAS NO(2)'] = df['CasNO(1)']
# Now create Value(2) column on original dataframe
df['Value(2)'] = df['CASNo(1)'].map(df_tmp.to_dict()['Value(2)'])
Try the following:
import pandas as pd
import numpy as np
#create an example of your table
list_CASNo1 = ['71-43-2', '100-41-4', np.nan, '1634-04-4']
list_Val1 = [np.nan]*len(list_CASNo1)
list_CASNo2 = [np.nan, np.nan, np.nan, '100-41-4']
list_Val2 = [np.nan, np.nan, np.nan, '18']
df = pd.DataFrame(zip(list_CASNo1, list_Val1, list_CASNo2, list_Val2), columns =['CASNo(1)','Value(1)','CAS NO(2)','Value(2)'], index = ['Benzene','Ethylbenzene','Gasonline Range Organics','Methyl-tert-butyl ether'])
#split the data to two dataframes
df1 = df[['CASNo(1)','Value(1)']]
df2 = df[['CAS NO(2)','Value(2)']]
#merge df2 to df1 based on the specified columns
#reset_index and set_index will take care
#that df_adjusted will have the same index names as df1
df_adjusted = df1.reset_index().merge(df2.dropna(),
how = 'left',
left_on = 'CASNo(1)',
right_on = 'CAS NO(2)').set_index('index')
but be careful with duplicates in your columns, those will cause the merge to fail..

Add column containing list value if column contains string in list

I'm trying to scan a particular column in a dataframe, eg df['x'] for values which I have in a separate list list = ['y', 'z', 'a', 'b']. How do I make pandas load a new column with the list value if df['x'] contains any, or more than one of the values from the list?
Thanks!
Use this:
In [720]: import pandas as pd
In [719]: if df['x'].str.contains('|'.join(list)).any():
...: df = pd.concat([df, pd.DataFrame(list)], axis=1))
...:

Pandas, groupby() function that separates out by "," but retains datatype?

Is there a way to use the .groupby() function to consolidate repeating rows in a data frame, separate out non-similar elements by a ',', and have the resulting .groupby() data frame retain the original datatype of the non-similar elements / convert non-similar items to an object?
As I understand, a column in pandas can hold multiple datatypes, so I feel like this should be possible.
I can use the .agg() function to separate out non-similar elements by a ',', but it doesn't work with non-string elements. I'd like to separate out the datatypes for error checking later when looking for rows with bad entries after the .groupby().
#Libraries
import pandas as pd
import numpy as np
#Example dataframe
col = ['Acol', 'Bcol', 'Ccol', 'Dcol']
df = pd.DataFrame(columns = col)
df['Acol'] = [1,1,2,3]
df['Bcol'] = ['a', 'b', 'c', 'd']
df['Ccol'] = [1,2,3,4]
df['Dcol'] = [1,'a',2,['a', 'b']]
#Code
outdf = df.groupby(by='Acol').agg(lambda x: ','.join(x)).reset_index()
outdf

Python csv Merge several column to one cell

How can I merge the several columns into one cell?
How can I convert a CSV file , that includes 1 by X cells where 1 is the row count and X is the column count unkown by the user,into a new CSV file including one single cell to combine all data from the original CSV file?
Right now://One row, four columns #####In fact, it will be variable number of columns as the data is extracted from a log file#############
1 A B C D
My purposes: //One row, one colum
1 A
B
C
D
The index of the row may not be always be 1 as I have many similar rows like that.
please refer to the original file and the expected new file at the following URL for details
https://drive.google.com/drive/folders/1cX0o86nbLAV5Foj5avCK6E2oAvExzt5t
161.csv is the file at the present
161-final.csv is what I want...
The count of rows will not be changed. However, the count of columns is a variable as the data is extracted from a log file. In the end, I only need each row has only one column.
I am just a fresh man with pandas. Will it be a way that I calculate the count of columns and then merge it into one cell?
Very appreciate your help.
code:
import pandas as pd
import numpy as np
df = pd.DataFrame([['a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b']])
print(df)
df1 = np.empty(df.shape[0], dtype=object)
df1[:] = df.values.tolist()
print(df1)
output:
0 1 2 3
0 a a a a
1 b b b b
[list(['a', 'a', 'a', 'a']) list(['b', 'b', 'b', 'b'])]
Not sure this is what you want, but you can manage to put the content of one row in a single column with pd.groupby()
import pandas as pd
df = pd.DataFrame(np.arange(4).reshape(1, 4))
df.columns = ['A', 'B', 'C', 'D']
df = df.groupby(df.index).apply(lambda x: x.values.ravel())
df = df.to_frame('unique_col') # If needed
Output :
unique_col
0 [0, 1, 2, 3]
Not sure this is possible to have the output not in a list as you mentioned it in your example.
I used a "stupid" way to do that. It is the "pd.read_table(file)" do the magic.
def CsvMerge(file1, file2,output):
##It is for preparation of the getFormated()
##Merge the "Format.csv" and the "text.csv.csv
df1=pd.read_csv(file1)
###This is for merging the columns to one column
###It can also avoid an csv error
df2=pd.read_table(file2)
df3=pd.concat([df1,df2],axis=1)
print(df3)
with open(output,'w+',encoding="utf-8") as f:
df3.to_csv(f,index=False)
f.close()

How to coerce pandas dataframe column to be normal index

I create a DataFrame from a dictionary. I want the keys to be used as index and the values as a single column. This is what I managed to do so far:
import pandas as pd
my_counts = {"A": 43, "B": 42}
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
I get the following:
count
letter
A 43
B 42
The problem is I want to concatenate (with pd.concat) this with other dataframes, that have the same index name (letter), and seemingly the same single column (count), but I end up with an
AssertionError: invalid dtype determination in get_concat_dtype.
I discovered that the other dataframes have a different type for their columns: Index(['count'], dtype='object'). The above dataframe has MultiIndex(levels=[['count']], labels=[[0]]).
How can I ensure my dataframe has a normal index?
You can prevent the multiIndex column with this code by eliminating a ',':
df = pd.DataFrame(pd.Series(my_counts, name=("count")).rename_axis("letter"))
df.columns
Output:
Index(['count'], dtype='object')
OR you can flatten your multiindex columns like this:
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
df.columns = df.columns.map(''.join)
df.columns
Output:
Index(['count'], dtype='object')

Categories

Resources