Suppose I have following dataset:
0
0 foo:1 bar:2 baz:3
1 bar:4 baz:5
2 foo:6
So each line is essentially a dict serialized into string, where key:value pairs are separated by space. There are hundreds of key:value pairs in each row, while number of unique keys is some few thousands. So data is sparse, so to speak.
What I want to get is a nice DataFrame where keys are columns and values are cells. And missing values are replaced by zeros. Like this:
foo bar baz
0 1 2 3
1 0 4 5
2 6 0 0
I know I can split string into key:value pairs:
In: frame[0].str.split(' ')
Out:
0
0 [foo:1, bar:2, baz:3]
1 [bar:4, baz:5]
2 [foo:6]
But what's next?
Edit: I'm running within AzureML Studio environment. So efficiency is important.
You can try list comprehension and then create new DataFrame from_records and fillna with 0:
s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
#[{'baz': '3', 'foo': '1', 'bar': '2'}, {'baz': '5', 'bar': '4'}, {'foo': '6'}]
print pd.DataFrame.from_records(d).fillna(0)
# bar baz foo
#0 2 3 1
#1 4 5 0
#2 0 0 6
EDIT:
You can get better performance, if use in function from_records parameters index and columns:
print df
0
0 foo:1 bar:2 baz:3
1 bar:4 baz:5
2 foo:6
3 foo:1 bar:2 baz:3 bal:8 adi:5
s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
[{'baz': '3', 'foo': '1', 'bar': '2'},
{'baz': '5', 'bar': '4'},
{'foo': '6'},
{'baz': '3', 'bal': '8', 'foo': '1', 'bar': '2', 'adi': '5'}]
If longest dictionary have all keys, which create all possible columns:
cols = sorted(d, key=len, reverse=True)[0].keys()
print cols
['baz', 'bal', 'foo', 'bar', 'adi']
df = pd.DataFrame.from_records( d, index= df.index, columns=cols )
df = df.fillna(0)
print df
baz bal foo bar adi
0 3 0 1 2 0
1 5 0 0 4 0
2 0 0 6 0 0
3 3 8 1 2 5
EDIT2: If longest dictionary doesnt contain all keys and keys are in other dictionaries, use:
list(set( val for dic in d for val in dic.keys()))
Sample:
print df
0
0 foo1:1 bar:2 baz1:3
1 bar:4 baz:5
2 foo:6
3 foo:1 bar:2 baz:3 bal:8 adi:5
s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
[{'baz1': '3', 'bar': '2', 'foo1': '1'},
{'baz': '5', 'bar': '4'},
{'foo': '6'},
{'baz': '3', 'bal': '8', 'foo': '1', 'bar': '2', 'adi': '5'}]
cols = list(set( val for dic in d for val in dic.keys()))
print cols
['bar', 'baz', 'baz1', 'bal', 'foo', 'foo1', 'adi']
df = pd.DataFrame.from_records( d, index= df.index, columns=cols )
df = df.fillna(0)
print df
bar baz baz1 bal foo foo1 adi
0 2 0 3 0 0 1 0
1 4 5 0 0 0 0 0
2 0 0 0 0 6 0 0
3 2 3 0 8 1 0 5
Related
The desired result is this:
id name
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
Currently I do it this way:
import pandas as pd
df = pd.DataFrame({'home_id': ['1', '3', '5', '7'],
'home_name': ['A', 'C', 'E', 'G'],
'away_id': ['2', '4', '6', '8'],
'away_name': ['B', 'D', 'F', 'H']})
id_col = pd.concat([df['home_id'], df['away_id']])
name_col = pd.concat([df['home_name'], df['away_name']])
result = pd.DataFrame({'id': id_col, 'name': name_col})
result = result.sort_index().reset_index(drop=True)
print(result)
But this form uses the index to reclassify the columns, generating possible errors in cases where there are equal indexes.
How can I intercalate the column values always being:
Use the home of the 1st line, then the away of the 1st line, then the home of the 2nd line, then the away of the 2nd line and so on...
try this:
out = pd.DataFrame(df.values.reshape(-1, 2), columns=['ID', 'Name'])
print(out)
>>>
ID Name
0 1 A
1 2 B
2 3 C
3 4 D
4 5 E
5 6 F
6 7 G
7 8 H
Similar to the python zip, you go iterate through both dataframes:
home = pd.DataFrame(df[['home_id', 'home_name']].values, columns=('id', 'name'))
away = pd.DataFrame(df[['away_id', 'away_name']].values, columns=('id', 'name'))
def zip_dataframes(df1, df2):
rows = []
for i in range(len(df1)):
rows.append(df1.iloc[i, :])
rows.append(df2.iloc[i, :])
return pd.concat(rows, axis=1).T
zip_dataframes(home, away)
id name
0 1 A
0 2 B
1 3 C
1 4 D
2 5 E
2 6 F
3 7 G
3 8 H
You can do this using pd.wide_to_long with a little column header renaming:
import pandas as pd
df = pd.DataFrame({'home_id': ['1', '3', '5', '7'],
'home_name': ['A', 'C', 'E', 'G'],
'away_id': ['2', '4', '6', '8'],
'away_name': ['B', 'D', 'F', 'H']})
dfr = df.rename(columns=lambda x: '_'.join(x.split('_')[::-1])).reset_index()
df_out = (pd.wide_to_long(dfr, ['id', 'name'], 'index', 'No', sep='_', suffix='.*')
.reset_index(drop=True)
.sort_values('id'))
df_out
Output:
id name
0 1 A
4 2 B
1 3 C
5 4 D
2 5 E
6 6 F
3 7 G
7 8 H
I have this df:
import pandas as pd
df = pd.DataFrame({'Time' : ['s_1234','s_1234', 's_1234', 's_5678', 's_8998','s_8998' ],
'Control' : ['A', '', '','B', 'C', ''],
'tot_1' : ['1', '1', '1','1', '1', '1'],
'tot_2' : ['2', '2', '2','2', '2', '2']})
--------
Time Control tot_1 tot_2
0 1234 A 1 2
1 1234 A 1 2
2 1234 1 2
3 5678 B 1 2
4 8998 C 1 2
5 8998 1 2
I would like each time an equal time value to be merged into one column. I would also like the "tot_1" and "tot_2" columns to be added together. And finally I would like to keep checking if present. Like:
Time Control tot_1 tot_2
0 1234 A 3 6
1 5678 B 1 2
2 8998 C 2 4
Your data is different then the example df.
construct df:
import pandas as pd
df = pd.DataFrame({'Time' : ['s_1234','s_1234', 's_1234', 's_5678', 's_8998','s_8998' ],
'Control' : ['A', '', '','B', 'C', ''],
'tot_1' : ['1', '1', '1','1', '1', '1'],
'tot_2' : ['2', '2', '2','2', '2', '2']})
df.Time = df.Time.str.split("_").str[1]
df = df.astype({"tot_1": int, "tot_2": int})
Group by Time and aggregate the values.
df.groupby('Time').agg({"Control": "first", "tot_1": "sum", "tot_2": "sum"}).reset_index()
Time Control tot_1 tot_2
0 1234 A 3 6
1 5678 B 1 2
2 8998 C 2 4
EDIT for comment: Not sure if thats the best way to do it, but you could construct your agg information like this:
n = 2
agg_ = {"Control": "first"} | {f"tot_{i+1}": "sum" for i in range(n)}
df.groupby('Time').agg(agg_).reset_index()
I want to create a df starting from this data
item_features = {'A': {1, 2, 3}, 'B':{7, 2, 1}, 'C':{3, 2}, 'D':{9, 11} }
pos = {'B', 'C'}
neg = {'A'}
I want to obtain the following dataset:
1 2 3 7 positive item_id
0 1 1 0 1 1 B
1 0 1 1 0 1 C
2 1 1 1 0 0 A
So i want that the df:
-have the df columns always ordered by their Number during the
creating process ? Like in this case it is 1 -2 - 3- 4 and i want
to be sure that i never have an order like 4-1-3-2
- contains only item_id that are in one of the 2 sets ( pos or neg).
- if the item is positive the corresponding 'positive' column will be set to 1 else 0
- the other columns_names are the value in the item_features dictionary, but only for the items that are either in pos or in neg.
- the value in the column must be 1 if the corresponding column name is in value of the item_features dict for that specific item.
What is an efficient way to do that ?
Use:
item_features = {'A': {1, 2, 3}, 'B':{4, 2, 1}, 'C':{3, 2}, 'D':{9, 11} }
pos = {'B', 'C'}
neg = {'A'}
#join sets
both = pos.union(neg)
#create Series, filter by both and create indicator columns
df=pd.Series(item_features).loc[both].agg(lambda x: '|'.join(map(str, x))).str.get_dummies()
df['item_id'] = df.index
df['positive'] = df['item_id'].isin(pos).astype(int)
df = df.reset_index(drop=True)
print(df)
1 2 3 4 item_id positive
0 0 1 1 0 C 1
1 1 1 0 1 B 1
2 1 1 1 0 A 0
If possible use lists instead sets:
item_features = {'A': {1, 2, 3}, 'B':{4, 2, 1}, 'C':{3, 2}, 'D':{9, 11} }
pos = ['B', 'C']
neg = ['A']
both = pos + neg
#create Series, filter by both and create indicator columns
df=pd.Series(item_features).loc[both].agg(lambda x: '|'.join(map(str, x))).str.get_dummies()
df = df.sort_index(axis=1, level=0, key=lambda x: x.astype(int))
df['item_id'] = df.index
df['positive'] = df['item_id'].isin(pos).astype(int)
df = df.reset_index(drop=True)
print(df)
1 2 3 4 item_id positive
0 1 1 0 1 B 1
1 0 1 1 0 C 1
2 1 1 1 0 A 0
EDIT: solution for improv performance is:
item_features = {'A': {1, 2, 3}, 'B':{4, 2, 11}, 'C':{3, 2}, 'D':{9, 11} }
pos = ['B', 'C']
neg = ['A']
both = pos + neg
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
d = { k: item_features[k] for k in both }
df = pd.DataFrame(mlb.fit_transform(d.values()),columns=mlb.classes_)
print (df)
df['item_id'] = d.keys()
df['positive'] = df['item_id'].isin(pos).astype(int)
print(df)
1 2 3 4 11 item_id positive
0 0 1 0 1 1 B 1
1 0 1 1 0 0 C 1
2 1 1 1 0 0 A 0
My DataFrame has some columns where each value can be "1", "2", "3" or "any". Here is an example:
>>> df = pd.DataFrame({'a': ['1', '2', 'any', '3'], 'b': ['any', 'any', '3', '1']})
>>> df
a b
0 1 any
1 2 any
2 any 3
3 3 1
In my case, "any" means that the value can be "1", "2" or "3". I would like to generate all possible rows using only values "1", "2" and "3" (or, in general, any list of values that I might have). Here is the expected output for the example above:
a b
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 3 3
7 3 1
I got this output with this kind of ugly and complicated approach:
a = df['a'].replace('any', '1,2,3').apply(lambda x: eval(f'[{str(x)}]')).explode()
result = pd.merge(df.drop(columns=['a']), a, left_index=True, right_index=True)
b = result['b'].replace('any', '1,2,3').apply(lambda x: eval(f'[{str(x)}]')).explode()
result = pd.merge(result.drop(columns=['b']), b, left_index=True, right_index=True)
result = result.drop_duplicates().reset_index(drop=True)
Is there any simpler and/or nicer approach?
You can replace the string any with, e.g. '1,2,3', then split and explode:
(df.replace('any', '1,2,3')
.apply(lambda x: x.str.split(',') if x.name in ['a','b'] else x)
.explode('a').explode('b')
.drop_duplicates(['a','b'])
)
Output:
a b c
0 1 1 1
0 1 2 1
0 1 3 1
1 2 1 1
1 2 2 1
1 2 3 1
2 3 3 1
3 3 1 1
I would not use eval and string manipulations, but just replace 'any' with a set of values
import pandas as pd
df = pd.DataFrame({'a': ['1', '2', 'any', '3'], 'b': ['any', 'any', '3', '1']})
df['c'] = '1'
df[df == 'any'] = {'1', '2', '3'}
for col in df:
df = df.explode(col)
df = df.drop_duplicates().reset_index(drop=True)
print(df)
This gives the result
a b c
0 1 2 1
1 1 3 1
2 1 1 1
3 2 2 1
4 2 3 1
5 2 1 1
6 3 3 1
7 3 1 1
I have a DataFrame that is in a too much "compact" form. The DataFrame is currently like this :
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': ['A','B'],
'bar': ['1', '2'],
'baz': [np.nan, '3']})
bar baz foo
0 1 NaN A
1 2 3 B
And I need to "unstack" it to be like so :
> df = pd.DataFrame({'foo': ['A','B', 'B'],
'type': ['bar', 'bar', 'baz'],
'value': ['1', '2', '3']})
foo type value
0 A bar 1
1 B bar 2
2 B baz 3
No matter how I try to pivot, I can't get it right.
Use melt() method:
In [39]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type')
Out[39]:
foo type value
0 A bar 1
1 B bar 2
2 A baz NaN
3 B baz 3
or
In [38]: pd.melt(df, id_vars='foo', value_vars=['bar','baz'], var_name='type').dropna()
Out[38]:
foo type value
0 A bar 1
1 B bar 2
3 B baz 3
set your index to foo, then stack:
df.set_index('foo').stack()
foo
A bar 1
B bar 2
baz 3
dtype: object