Pandas dataframe compare same keys - python

Hi i want to compare same keys of a pandas dataframe.
car
values(dict)
0
audi1
{'colour': 'black', 'PS': '3', 'owner': 'peter'}
1
audi2
{'owner': 'fred', 'colour': 'black', 'PS': '230', 'number': '3'}
2
ford
{'windows': '3', 'PS': '3', 'owner': 'peter'}
3
bmw
{'colour': 'black', 'windows': 'no', 'owner': 'peter', 'number': '3'}
wanted solution
colour
owner
PS
number
windows
black
3
0
0
0
0
peter
0
3
0
0
0
3
0
0
2
2
1
fred
0
1
0
0
0
no
0
0
0
0
1
I hope my problem is understandable
d = {'audi1': {'colour': 'black', 'PS': '3', 'owner': 'peter'}, 'audi2': {'owner': 'fred', 'colour': 'black', 'PS': '230', 'number': '3'}, 'ford': {'windows': '3', 'PS': '3', 'owner': 'peter'}, 'bmw': {'colour': 'black', 'windows': 'no', 'owner': 'peter', 'number': '3'}}
df = pd.DataFrame(d.items(), columns=['car', 'values'])

You can create a new dataframe from the dictionaries present in the values column then stack the frame to reshape, finally use crosstab to create a frequency table:
s = pd.DataFrame(df['values'].tolist()).stack()
table = pd.crosstab(s, s.index.get_level_values(1))
Alternate but similar approach with groupby + value_counts followed by unstack to reshape:
s = pd.DataFrame(df['values'].tolist()).stack()
table = s.groupby(level=1).value_counts().unstack(level=0, fill_value=0)
>>> table
PS colour number owner windows
230 1 0 0 0 0
3 2 0 2 0 1
black 0 3 0 0 0
fred 0 0 0 1 0
no 0 0 0 0 1
peter 0 0 0 3 0

Related

Merge row if cels are equals pandas

I have this df:
import pandas as pd
df = pd.DataFrame({'Time' : ['s_1234','s_1234', 's_1234', 's_5678', 's_8998','s_8998' ],
'Control' : ['A', '', '','B', 'C', ''],
'tot_1' : ['1', '1', '1','1', '1', '1'],
'tot_2' : ['2', '2', '2','2', '2', '2']})
--------
Time Control tot_1 tot_2
0 1234 A 1 2
1 1234 A 1 2
2 1234 1 2
3 5678 B 1 2
4 8998 C 1 2
5 8998 1 2
I would like each time an equal time value to be merged into one column. I would also like the "tot_1" and "tot_2" columns to be added together. And finally I would like to keep checking if present. Like:
Time Control tot_1 tot_2
0 1234 A 3 6
1 5678 B 1 2
2 8998 C 2 4
Your data is different then the example df.
construct df:
import pandas as pd
df = pd.DataFrame({'Time' : ['s_1234','s_1234', 's_1234', 's_5678', 's_8998','s_8998' ],
'Control' : ['A', '', '','B', 'C', ''],
'tot_1' : ['1', '1', '1','1', '1', '1'],
'tot_2' : ['2', '2', '2','2', '2', '2']})
df.Time = df.Time.str.split("_").str[1]
df = df.astype({"tot_1": int, "tot_2": int})
Group by Time and aggregate the values.
df.groupby('Time').agg({"Control": "first", "tot_1": "sum", "tot_2": "sum"}).reset_index()
Time Control tot_1 tot_2
0 1234 A 3 6
1 5678 B 1 2
2 8998 C 2 4
EDIT for comment: Not sure if thats the best way to do it, but you could construct your agg information like this:
n = 2
agg_ = {"Control": "first"} | {f"tot_{i+1}": "sum" for i in range(n)}
df.groupby('Time').agg(agg_).reset_index()

Group 2 columns to categories based on the column values

I am new to Python and Pandas.
My DataFrame looks like this:
df = pd.DataFrame({'ID': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd'],
'Position': ['0', '1', '2', '3', '4', '0', '1', '2', '3', '0', '1', '2', '0', '1', '2'],
'Brand': ['Mazda', 'BMW', 'Ford', 'Fiat', 'Dodge', 'Mazda', 'BMW', 'Ford', 'Fiat', 'BMW', 'Ford', 'Fiat', 'BMW', 'Ford', 'Fiat']
})
I want to group the position and brand together to make a category.
The output would look like this:
ID Group
a 1
b 2
c 3
d 3
Because group 1 is:
0 Mazda
1 BMW
2 Ford
3 Fiat
4 Dodge
And c = d because they both have the same care makers in the same order so the group is the same - 3:
0 BMW
1 Ford
2 Fiat
If d would have different order defined by the column position it would be a different category:
0 Fiat
1 BWM
2 Ford
How could I achieve the output as defined in the second code block?
Thank you for your suggestions.
You can distinguish same 3 first rows per groups with filter by head and convert to tuples and then use Series.factorize:
s = (df.groupby(['ID'], sort=False)['Position','Brand']
.apply(lambda x: tuple(x.head(3).values.ravel())))
df = pd.DataFrame({'ID':s.index, 'Cat':pd.factorize(s)[0] + 1})
print (df)
ID Cat
0 a 1
1 b 1
2 c 2
3 d 2

Flatten json to get multiple columns in Pandas

I have a sample dataframe as
sample_df = pd.DataFrame({'id': [1, 2], 'fruits' :[
[{'name': u'mango', 'cost': 100, 'color': u'yellow', 'size': 12}],
[{'name': u'mango', 'cost': 150, 'color': u'yellow', 'size': 21},
{'name': u'banana', 'cost': 200, 'color': u'green', 'size': 10} ]
]})
I would like to flatten the fruits column to get new columns like name, cost, color and size. One id can have more than 1 fruit entry. For example id 2 has information for 2 fruits mango and banana
print(sample_df)
fruits id
0 [{'name': 'mango', 'cost': 100, 'color': 'yell... 1
1 [{'name': 'mango', 'cost': 150, 'color': 'yell... 2
In the output I would like to have 3 records, 1 record with fruit information for id 1 and 2 records for fruit information for id 2
Is there a way to parse this structure using pandas ?
First unnesting your columns , then concat the values after called DataFrame
s=unnesting(sample_df,['fruits']).reset_index(drop=True)
df=pd.concat([s.drop('fruits',1),pd.DataFrame(s.fruits.tolist())],axis=1)
df
Out[149]:
id color cost name size
0 1 yellow 100 mango 12
1 2 yellow 150 mango 21
2 2 green 200 banana 10
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
Method 2
sample_df.set_index('id').fruits.apply(pd.Series).stack().apply(pd.Series).reset_index(level=0)
Out[159]:
id color cost name size
0 1 yellow 100 mango 12
0 2 yellow 150 mango 21
1 2 green 200 banana 10

How to paste a list into a multi index dataframe?

Could you let me know how to paste a list into a multi-index dataframe?
I wanna paste list1 into column([func1 - In - Name1, Name2]['Val6'])
and list2 into column([func1 - Out - Name3, Name4]['Val6']) in multi-index dataframe
below is dataframe I used
from pandas import Series, DataFrame
raw_data = {'Function': ['env', 'env', 'env', 'func1', 'func1', 'func1'],
'Type': ['In', 'In', 'In', 'In','In', 'out'],
'Name': ['Volt', 'Temp', 'BD#', 'Name1','Name2', 'Name3'],
'Val1': ['Max', 'High', '1', '3', '5', '6'],
'Val2': ['Typ', 'Mid', '2', '4', '7', '6'],
'Val3': ['Min', 'Low', '3', '3', '6', '3'],
'Val4': ['Max', 'High', '4', '3', '9', '4'],
'Val5': ['Max', 'Low', '5', '3', '4', '5'] }
df = DataFrame(raw_data)
df= df.set_index(["Function", "Type","Name"])
df['Val6'] = np.NaN
list1 = [1,2]
list2 = [3,4]
print (df)
below is printed dataframe
Val1 Val2 Val3 Val4 Val5 Val6
Function Type Name
env In Volt Max Typ Min Max Max NaN
Temp High Mid Low High Low NaN
BD# 1 2 3 4 5 NaN
func1 In Name1 4 2 3 4 5 NaN
Name2 6 7 6 9 4 NaN
out Name3 6 6 3 4 5 NaN
Name4 3 3 4 5 6 NaN
Below is expected results.
I'd like to sequentially put each list1 and list2 into dataframe instead of NaN like below
Val1 Val2 Val3 Val4 Val5 Val6
Function Type Name
env In Volt Max Typ Min Max Max NaN
Temp High Mid Low High Low NaN
BD# 1 2 3 4 5 NaN
func1 In Name1 4 2 3 4 5 1
Name2 6 7 6 9 4 2
out Name3 6 6 3 4 5 3
Name4 3 3 4 5 6 4
I have tried to use concat, replace functions to do it but failed
In more complex datafrmae, I think it is better to use mask of multi -index in dataframe.
list1=[1,2]
list2=[3,4]
m1 = df.index.get_level_values(0) == 'func1'
m2 = df.index.get_level_values(1) == 'In'
list1 = [float(i) for i in list1]
df_list1=pd.DataFrame(list1)
df.replace(df[m1&m2]['Val6'], df_list1)
Unfortunately, I don't have any idea to solve the problem. T_T
Please give me some advice.
IIUC add an extra line at the end, simple modify it like it's a non-multi-index dataframe:
df['Val6'] = df['Val6'].tolist()[:-4] + list1 + list2
So your code would be:
from pandas import Series, DataFrame
raw_data = {'Function': ['env', 'env', 'env', 'func1', 'func1', 'func1'],
'Type': ['In', 'In', 'In', 'In','In', 'out'],
'Name': ['Volt', 'Temp', 'BD#', 'Name1','Name2', 'Name3'],
'Val1': ['Max', 'High', '1', '3', '5', '6'],
'Val2': ['Typ', 'Mid', '2', '4', '7', '6'],
'Val3': ['Min', 'Low', '3', '3', '6', '3'],
'Val4': ['Max', 'High', '4', '3', '9', '4'],
'Val5': ['Max', 'Low', '5', '3', '4', '5'] }
df = DataFrame(raw_data)
df= df.set_index(["Function", "Type","Name"])
df['Val6'] = np.NaN
list1 = [1,2]
list2 = [3,4]
df['Val6'] = df['Val6'].tolist()[:-4] + list1 + list2
print(df)
Output:
Val1 Val2 Val3 Val4 Val5 Val6
Function Type Name
env In Volt Max Typ Min Max Max NaN
Temp High Mid Low High Low NaN
BD# 1 2 3 4 5 1.0
func1 In Name1 3 4 3 3 3 2.0
Name2 5 7 6 9 4 3.0
out Name3 6 6 3 4 5 4.0

Pandas: Reconstruct dataframe from strings of key:value pairs

Suppose I have following dataset:
0
0 foo:1 bar:2 baz:3
1 bar:4 baz:5
2 foo:6
So each line is essentially a dict serialized into string, where key:value pairs are separated by space. There are hundreds of key:value pairs in each row, while number of unique keys is some few thousands. So data is sparse, so to speak.
What I want to get is a nice DataFrame where keys are columns and values are cells. And missing values are replaced by zeros. Like this:
foo bar baz
0 1 2 3
1 0 4 5
2 6 0 0
I know I can split string into key:value pairs:
In: frame[0].str.split(' ')
Out:
0
0 [foo:1, bar:2, baz:3]
1 [bar:4, baz:5]
2 [foo:6]
But what's next?
Edit: I'm running within AzureML Studio environment. So efficiency is important.
You can try list comprehension and then create new DataFrame from_records and fillna with 0:
s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
#[{'baz': '3', 'foo': '1', 'bar': '2'}, {'baz': '5', 'bar': '4'}, {'foo': '6'}]
print pd.DataFrame.from_records(d).fillna(0)
# bar baz foo
#0 2 3 1
#1 4 5 0
#2 0 0 6
EDIT:
You can get better performance, if use in function from_records parameters index and columns:
print df
0
0 foo:1 bar:2 baz:3
1 bar:4 baz:5
2 foo:6
3 foo:1 bar:2 baz:3 bal:8 adi:5
s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
[{'baz': '3', 'foo': '1', 'bar': '2'},
{'baz': '5', 'bar': '4'},
{'foo': '6'},
{'baz': '3', 'bal': '8', 'foo': '1', 'bar': '2', 'adi': '5'}]
If longest dictionary have all keys, which create all possible columns:
cols = sorted(d, key=len, reverse=True)[0].keys()
print cols
['baz', 'bal', 'foo', 'bar', 'adi']
df = pd.DataFrame.from_records( d, index= df.index, columns=cols )
df = df.fillna(0)
print df
baz bal foo bar adi
0 3 0 1 2 0
1 5 0 0 4 0
2 0 0 6 0 0
3 3 8 1 2 5
EDIT2: If longest dictionary doesnt contain all keys and keys are in other dictionaries, use:
list(set( val for dic in d for val in dic.keys()))
Sample:
print df
0
0 foo1:1 bar:2 baz1:3
1 bar:4 baz:5
2 foo:6
3 foo:1 bar:2 baz:3 bal:8 adi:5
s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
[{'baz1': '3', 'bar': '2', 'foo1': '1'},
{'baz': '5', 'bar': '4'},
{'foo': '6'},
{'baz': '3', 'bal': '8', 'foo': '1', 'bar': '2', 'adi': '5'}]
cols = list(set( val for dic in d for val in dic.keys()))
print cols
['bar', 'baz', 'baz1', 'bal', 'foo', 'foo1', 'adi']
df = pd.DataFrame.from_records( d, index= df.index, columns=cols )
df = df.fillna(0)
print df
bar baz baz1 bal foo foo1 adi
0 2 0 3 0 0 1 0
1 4 5 0 0 0 0 0
2 0 0 0 0 6 0 0
3 2 3 0 8 1 0 5

Categories

Resources