I have a data set like this:
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]})
Where Dan in A has the corresponding number 3 in B, and where Dan in C has the corresponding number 6 in D.
I would like to create 2 new columns, one with the name Dan and the other with 9 (3+6).
Desired Output
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12], 'E': ['Dan', 'Tom', 'Mary'], 'F': [9, 7, 9], 'G': ['John', 'Mike'], 'H': [1, 12]})
For names, John and Mike 2 different columns with their values unchanged.
I have tried using some for loops and .loc, but I am not anywhere close.
Thanks!
df = data[['A','B']]
_df = data[['C','D']]
_df.columns = ['A','B']
df = pd.concat([df,_df]).groupby(['A'],as_index=False)['B'].sum().reset_index()
df.columns = ['E','F']
data = data.merge(df,how='left',left_on=['A'],right_on=['E'])
Although you can join on column C too, that's something you have choose. Or alternatively if you want just columns E & F, then skip the last line!
You can try this:
import pandas as pd
data = {'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]}
df=pd.DataFrame(data)
df=df.rename(columns={"C": "A", "D": "B"})
df=df.stack().reset_index(0, drop=True).rename_axis("index").reset_index()
df=df.pivot(index=df.index//2, columns="index")
df.columns=map(lambda x: x[1], df.columns)
df=df.groupby("A", as_index=False).sum()
Outputs:
>>> df
A B
0 Dan 9
1 John 1
2 Mary 9
3 Mike 12
4 Tom 7
Related
I am very new to Python. I want to compare two dataframes. They both have the same columns, first column is the key variable (ID). My goal is to print the differences.
For example:
import pandas as pd
import numpy as np
dframe1 = {'ID': [1, 2, 3, 4, 5], 'Apple': ['C', 'B', 'C', 'A', 'E'], 'Pear': [2, 3, 5, 6, 7]}
dframe2 = {'ID': [4, 2, 1, 3], 'Apple': ['A', 'C', 'C', 'C'], 'Pear': [6, 'NA', 'NA', 5]}
df1 = pd.DataFrame(dframe1)
df2 = pd.DataFrame(dframe2)
import datacompy
compare=datacompy.Compare(
df1,
df2,
df1_name='Reference',
df2_name='Test',
on_index=True
)
print(compare.report())
This produces a comparison report but I want my output to be like the following. Columns of my desired output:
out1 = {'var.x': ['Apple', 'Pear', 'Pear'], 'var.Y': ['Apple', 'Pear', 'Pear'], 'ID': [2, 1, 2],'values.x': ['B', '2', '3'], 'values.Y': ['C','NA','NA'],'row.x': [2, 1, 4], 'row.y': [2, 3, 1]}
outp = pd.DataFrame(out1)
print(outp)
Thanks a lot for your support.
I have a big pandas dataframe (about 150000 rows). I have tried method groupby('id') but in returns group tuples. I need just a list of dataframes, and then I convert them into np array batches to put into an autoencoder (like this https://www.datacamp.com/community/tutorials/autoencoder-keras-tutorial but 1D)
So I have a pandas dataset :
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John', 'John', 'John', 'John', 'Krish'], 'Age': [20, 21, 19, 18, 18, 18, 18, 18],'id': [1, 1, 2, 2, 3, 3, 3, 3]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.head(10)
I need the same output (just a list of pandas dataframe). Also, i need a list of unsorted lists, it is important, because its time series.
data1 = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21],'id': [1, 1]}
data2 = {'Name': ['Krish', 'John', ], 'Age': [19, 18, ],'id': [2, 2]}
data3 = {'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18],'id': [3, 3, 3, 3]}
pd_1 = pd.DataFrame(data1)
pd_2 = pd.DataFrame(data2)
pd_3 = pd.DataFrame(data3)
array_list = [pd_1,pd_2,pd_3]
array_list
How can I split dataframe ?
Or you can TRY:
array_list = df.groupby(df.id.values).agg(list).to_dict('records')
Output:
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'],
'Age': [18, 18, 18, 18],
'id': [3, 3, 3, 3]}]
UPDATE:
If you need a dataframe list:
df_list = [g for _,g in df.groupby('id')]
#OR
df_list = [pd.DataFrame(i) for i in df.groupby(df.id.values).agg(list).to_dict('records')]
To reset the index of each dataframe:
df_list = [g.reset_index(drop=True) for _,g in df.groupby('id')]
Let us group on id and using to_dict with orientation list prepare records per id
[g.to_dict('list') for _, g in df.groupby('id', sort=False)]
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18], 'id': [3, 3, 3, 3]}]
I am not sure about your need but does something like this works for you?
df = df.set_index("id")
[df.loc[i].to_dict("list") for i in df.index.unique()]
or if you really want to keep your index in your list:
[df.query(f"id == {i}").to_dict("list") for i in df.id.unique()]
If you want to create new DataFrames storing the values:
(Previous answers are more relevant if you want to create a list)
This can be solved by iterating over each id using a for loop and create a new dataframe every loop.
I refer you to #40498463 and the other answers for the usage of the groupby() function. Please note that I have changed the name of the id column to Id.
for Id, df in df.groupby("Id"):
str1 = "df"
str2 = str(Id)
new_name = str1 + str2
exec('{} = pd.DataFrame(df)'.format(new_name))
Output:
df1
Name Age Id
0 Tom 20 1
1 Joseph 21 1
df2
Name Age Id
2 Krish 19 2
3 John 18 2
df3
Name Age Id
4 John 18 3
5 John 18 3
6 John 18 3
7 Krish 18 3
I have the following dataframe:
df = pd.DataFrame({'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'Name': ['name1', 'name2', 'name3', 'name4', 'name5', 'name6',
'name7', 'name8', 'name9', 'name10', 'name11', 'name12'],
'Category': ['A', 'A/B', 'B/C', 'A/B/C', 'A/B/C', 'B/C',
'A/B', 'A/B/C', 'A/B/C', 'B', 'C', 'A/C']})
I need to get a number of dataframes for each category. For instance, as output for category A:
df_a = pd.DataFrame({'ID': [1, 2, 4, 5, 7, 8, 9, 12],
'Name': ['name1', 'name2', 'name4', 'name5',
'name7', 'name8', 'name9', 'name12'],
'Category': ['A', 'A', 'A', 'A',
'A', 'A', 'A', 'A']})
Let's split the categories, explode the data frame and groupby:
df_dicts = {k:v for k,v in (df.assign(Category=df['Category'].str.split('/'))
.explode('Category')
.groupby('Category')
)
}
And you get, for example df_dicts['A']:
ID Name Category
0 1 name1 A
1 2 name2 A
3 4 name4 A
4 5 name5 A
6 7 name7 A
7 8 name8 A
8 9 name9 A
11 12 name12 A
Just keep rows whose category contains "A":
df_a = df[df['Category'].str.contains('A')]
I'm new to python can anyone help me with this.
For example, I have a data frame of
data = pd.DataFrame({'a': [1,1,2,2,2,3], 'b': [12,22,23,34,44,55], 'c'['a','','','','c',''], 'd':['','b','b','a','a','']})
I want to sum a and ignore the different in b
data = ({'a':[1,2,3],'c':['a','c',''],'d':['b','baa','']})
How can I do this?
You Question is bit difficult to under stand but if I guess correctly, this can be solution.
data = {'a': [1,1,2,2,2,3], 'b': [12,22,23,34,44,55], 'c':['a','','','','c',''], 'd':['','b','b','a','a','']}
new_data = {k: list(set(v)) for k, v in data.items()}
{'a': [1, 2, 3],
'b': [34, 12, 44, 55, 22, 23],
'c': ['', 'a', 'c'],
'd': ['', 'a', 'b']}
You need groupby + agg
data.groupby('a').agg({'b':'sum','c' : lambda x : ''.join(x),'d' : lambda x : ''.join(x)}).reset_index()
Out[54]:
a d c b
0 1 b a 34
1 2 baa c 101
2 3 55
I am trying to convert a dataframe to dictionary:
xtest_cat = xtest_cat.T.to_dict().values()
but it gives a warning :
Warning: DataFrame columns are not unique, some columns will be omitted python
I checked the columns names of the dataframe(xtest_cat) :
len(list(xtest_cat.columns.values))
len(set(list(xtest_cat.columns.values)))
they are all unique.
Can anyone help me out ?
You can use reset_index for create unique index:
xtest_cat = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
xtest_cat.index = [0,1,1]
print (xtest_cat)
A B C
0 1 4 7
1 2 5 8
1 3 6 9
print (xtest_cat.index.is_unique)
False
xtest_cat.reset_index(drop=True, inplace=True)
print (xtest_cat)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
xtest_cat = xtest_cat.T.to_dict().values()
print (xtest_cat)
dict_values([{'B': 4, 'C': 7, 'A': 1}, {'B': 5, 'C': 8, 'A': 2}, {'B': 6, 'C': 9, 'A': 3}])
You can also omit T and add parameter orient='index':
xtest_cat = xtest_cat.to_dict(orient='index').values()
print (xtest_cat)
dict_values([{'B': 4, 'C': 7, 'A': 1}, {'B': 5, 'C': 8, 'A': 2}, {'B': 6, 'C': 9, 'A': 3}])
orient='record' is better:
xtest_cat = xtest_cat.to_dict(orient='records')
print (xtest_cat)
[{'B': 4, 'C': 7, 'A': 1}, {'B': 5, 'C': 8, 'A': 2}, {'B': 6, 'C': 9, 'A': 3}]