Pandas groupby and sum to other dataframe

Pandas groupby and sum to other dataframe - python

I have a dictionary where key is a file name and values are dataframes that looks like:
col1 col2
A 10
B 20
A 20
A 10
B 10
I want to groupby based on 'col1' to sum values in 'col2' and store it to new dataframe 'df' whose output should look like:
The output should look like:
Index A B
file1 40 30
file2 50 35
My code:
df=pd.DataFrame(columns=['A','B'])
for key, value in data.items():
cnt=(value.groupby('Type')['Packets'].sum())
print(cnt)
df.append(cnt,ignore_index=True)

Another suggested way with group-by, transpose, and row stack into dataframe.
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'col1':['A', 'B', 'A', 'A', 'B'], 'col2':[10, 20, 20, 10, 10]})
df_2 = pd.DataFrame({'col1':['A', 'B', 'A', 'A', 'B'], 'col2':[30, 10, 15, 5, 25]})
df_1_agg = df_1.groupby(['col1']).agg({'col2':'sum'}).T.values
df_2_agg = df_2.groupby(['col1']).agg({'col2':'sum'}).T.values
pd.DataFrame(np.row_stack((df_1_agg, df_2_agg)), index = ['file1', 'file2']).rename(columns = {0:'A', 1:'B'})
Edited: to generalize, you need to put it into the function and loop through. Also, need to format the index (file{i}) for general cases.
lst_df = [df_1, df_2]
df_all = []
for i in lst_df:
# iterate every data faame
df_agg = i.groupby(['col1']).agg({'col2':'sum'}).T.values
# append to the accumulator
df_all.append(df_agg)
pd.DataFrame(np.row_stack(df_all), index = ['file1', 'file2']).rename(columns = {0:'A', 1:'B'})

You should try to avoid appending in a loop. This is inefficient and not recommended.
Instead, you can concatenate your dataframes into one large dataframe, then use pivot_table:
# aggregate values in your dictionary, adding a "file" series
df_comb = pd.concat((v.assign(file=k) for k, v in data.items()), ignore_index=True)
# perform 'sum' aggregation, specifying index, columns & values
df = df_comb.pivot_table(index='file', columns='col1', values='col2', aggfunc='sum')
Explanation
v.assign(file=k) adds a series file to each dataframe with value set to the filename.
pd.concat concatenates all the dataframes in your dictionary.
pd.DataFrame.pivot_table is a Pandas method which allows you to create Excel-style pivot tables via specifying index, columns, values and aggfunc (aggregation function).

Related

Populating a column based off of values in another column

Hi I am working with pandas to manipulate some lab data. I currently have a data frame with 5 columns.
The first three columns(Analyte,CAS NO(1), and Value) are in the correct order.
The last two columns(CAS NO 2 and Value 2) are not.
Is there a way to align CAS No(2) and Value(2) with the first three columns based off of matching CAS Numbers(aka CAS NO(2)=CAS(NO1).
I am new to python and pandas. Thank you for your help

you can reorder the columns by reassigning the df variable as a slice of itself indexed on a list whose entries are the column names in question.
colidx = ['Analyte', 'CAS NO(1)', 'CAS NO(2)']
df = df[colidx]

Better provide input data in text format so we can copy-paste it. I understand you question like this: You need to sort two last columns together, so that CAS NO(2) matches CAS NO(1).
Since CAS NO(2)=CAS(NO1) you then do not need duplicated CAS NO(2) column, right?
Split off two last columns and make a Series from it, then convert that series to dict, and use that dict to map new values.
# Split 2 last columns and assign index.
df_tmp = df[['CAS NO(2)', 'Value(2)']]
df_tmp = df_tmp.set_index('CAS NO(2)')
# Keep only 3 first columns of original dataframe
df = df[['Analyte',' CASNo(1)', 'Value(1)']]
# Now copy the CasNO(1) to CAS NO(2)
df['CAS NO(2)'] = df['CasNO(1)']
# Now create Value(2) column on original dataframe
df['Value(2)'] = df['CASNo(1)'].map(df_tmp.to_dict()['Value(2)'])

Try the following:
import pandas as pd
import numpy as np
#create an example of your table
list_CASNo1 = ['71-43-2', '100-41-4', np.nan, '1634-04-4']
list_Val1 = [np.nan]*len(list_CASNo1)
list_CASNo2 = [np.nan, np.nan, np.nan, '100-41-4']
list_Val2 = [np.nan, np.nan, np.nan, '18']
df = pd.DataFrame(zip(list_CASNo1, list_Val1, list_CASNo2, list_Val2), columns =['CASNo(1)','Value(1)','CAS NO(2)','Value(2)'], index = ['Benzene','Ethylbenzene','Gasonline Range Organics','Methyl-tert-butyl ether'])
#split the data to two dataframes
df1 = df[['CASNo(1)','Value(1)']]
df2 = df[['CAS NO(2)','Value(2)']]
#merge df2 to df1 based on the specified columns
#reset_index and set_index will take care
#that df_adjusted will have the same index names as df1
df_adjusted = df1.reset_index().merge(df2.dropna(),
how = 'left',
left_on = 'CASNo(1)',
right_on = 'CAS NO(2)').set_index('index')
but be careful with duplicates in your columns, those will cause the merge to fail..

Add column containing list value if column contains string in list

I'm trying to scan a particular column in a dataframe, eg df['x'] for values which I have in a separate list list = ['y', 'z', 'a', 'b']. How do I make pandas load a new column with the list value if df['x'] contains any, or more than one of the values from the list?
Thanks!

Use this:
In [720]: import pandas as pd
In [719]: if df['x'].str.contains('|'.join(list)).any():
...: df = pd.concat([df, pd.DataFrame(list)], axis=1))
...:

Pandas, groupby() function that separates out by "," but retains datatype?

Is there a way to use the .groupby() function to consolidate repeating rows in a data frame, separate out non-similar elements by a ',', and have the resulting .groupby() data frame retain the original datatype of the non-similar elements / convert non-similar items to an object?
As I understand, a column in pandas can hold multiple datatypes, so I feel like this should be possible.
I can use the .agg() function to separate out non-similar elements by a ',', but it doesn't work with non-string elements. I'd like to separate out the datatypes for error checking later when looking for rows with bad entries after the .groupby().
#Libraries
import pandas as pd
import numpy as np
#Example dataframe
col = ['Acol', 'Bcol', 'Ccol', 'Dcol']
df = pd.DataFrame(columns = col)
df['Acol'] = [1,1,2,3]
df['Bcol'] = ['a', 'b', 'c', 'd']
df['Ccol'] = [1,2,3,4]
df['Dcol'] = [1,'a',2,['a', 'b']]
#Code
outdf = df.groupby(by='Acol').agg(lambda x: ','.join(x)).reset_index()
outdf

Python csv Merge several column to one cell

How can I merge the several columns into one cell?
How can I convert a CSV file , that includes 1 by X cells where 1 is the row count and X is the column count unkown by the user,into a new CSV file including one single cell to combine all data from the original CSV file?
Right now://One row, four columns #####In fact, it will be variable number of columns as the data is extracted from a log file#############
1 A B C D
My purposes: //One row, one colum
1 A
B
C
D
The index of the row may not be always be 1 as I have many similar rows like that.
please refer to the original file and the expected new file at the following URL for details
https://drive.google.com/drive/folders/1cX0o86nbLAV5Foj5avCK6E2oAvExzt5t
161.csv is the file at the present
161-final.csv is what I want...
The count of rows will not be changed. However, the count of columns is a variable as the data is extracted from a log file. In the end, I only need each row has only one column.
I am just a fresh man with pandas. Will it be a way that I calculate the count of columns and then merge it into one cell?
Very appreciate your help.

code:
import pandas as pd
import numpy as np
df = pd.DataFrame([['a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b']])
print(df)
df1 = np.empty(df.shape[0], dtype=object)
df1[:] = df.values.tolist()
print(df1)
output:
0 1 2 3
0 a a a a
1 b b b b
[list(['a', 'a', 'a', 'a']) list(['b', 'b', 'b', 'b'])]

Not sure this is what you want, but you can manage to put the content of one row in a single column with pd.groupby()
import pandas as pd
df = pd.DataFrame(np.arange(4).reshape(1, 4))
df.columns = ['A', 'B', 'C', 'D']
df = df.groupby(df.index).apply(lambda x: x.values.ravel())
df = df.to_frame('unique_col') # If needed
Output :
unique_col
0 [0, 1, 2, 3]
Not sure this is possible to have the output not in a list as you mentioned it in your example.

I used a "stupid" way to do that. It is the "pd.read_table(file)" do the magic.
def CsvMerge(file1, file2,output):
##It is for preparation of the getFormated()
##Merge the "Format.csv" and the "text.csv.csv
df1=pd.read_csv(file1)
###This is for merging the columns to one column
###It can also avoid an csv error
df2=pd.read_table(file2)
df3=pd.concat([df1,df2],axis=1)
print(df3)
with open(output,'w+',encoding="utf-8") as f:
df3.to_csv(f,index=False)
f.close()

How to coerce pandas dataframe column to be normal index

I create a DataFrame from a dictionary. I want the keys to be used as index and the values as a single column. This is what I managed to do so far:
import pandas as pd
my_counts = {"A": 43, "B": 42}
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
I get the following:
count
letter
A 43
B 42
The problem is I want to concatenate (with pd.concat) this with other dataframes, that have the same index name (letter), and seemingly the same single column (count), but I end up with an
AssertionError: invalid dtype determination in get_concat_dtype.
I discovered that the other dataframes have a different type for their columns: Index(['count'], dtype='object'). The above dataframe has MultiIndex(levels=[['count']], labels=[[0]]).
How can I ensure my dataframe has a normal index?

You can prevent the multiIndex column with this code by eliminating a ',':
df = pd.DataFrame(pd.Series(my_counts, name=("count")).rename_axis("letter"))
df.columns
Output:
Index(['count'], dtype='object')
OR you can flatten your multiindex columns like this:
df = pd.DataFrame(pd.Series(my_counts, name=("count",)).rename_axis("letter"))
df.columns = df.columns.map(''.join)
df.columns
Output:
Index(['count'], dtype='object')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby and sum to other dataframe - python

Related

Populating a column based off of values in another column

Add column containing list value if column contains string in list

Pandas, groupby() function that separates out by "," but retains datatype?

Python csv Merge several column to one cell

How to coerce pandas dataframe column to be normal index

Categories

Resources