Pandas DataFrame from list/dict/list - python

I have some data in this form:
a = [{'table': 'a', 'field':['apple', 'pear']},
{'table': 'b', 'field':['grape', 'berry']}]
I want to create a dataframe that looks like this:
field table
0 apple a
1 pear a
2 grape b
3 berry b
When I try this:
pd.DataFrame.from_records(a)
I get this:
field table
0 [apple, pear] a
1 [grape, berry] b
I'm using a loop to restructure my original data, but I think there must be a more straightforward and simpler methid.

You can use a list comprehension to concatenate a series of dataframes, one for each dictionary in a.
>>> pd.concat([pd.DataFrame({'table': d['table'], # Per #piRSquared for simplification.
'field': d['field']})
for d in a]).reset_index(drop=True)
field table
0 apple a
1 pear a
2 grape b
3 berry b

Option 1
comprehension
pd.DataFrame([{'table': d['table'], 'field': f} for d in a for f in d['field']])
field table
0 apple a
1 pear a
2 grape b
3 berry b
Option 2
reconstruct
d1 = pd.DataFrame(a)
pd.DataFrame(dict(
table=d1.table.repeat(d1.field.str.len()),
field=np.concatenate(d1.field)
)).reset_index(drop=True)
field table
0 apple a
1 pear a
2 grape b
3 berry b
Option 3
Rubik's Cube
pd.DataFrame(a).set_index('table').field.apply(pd.Series) \
.stack().reset_index('table', name='field').reset_index(drop=True)
table field
0 a apple
1 a pear
2 b grape
3 b berry

Or you can try using pd.wide_to_long ,I do want to use lreshape, but it is undocumented and personally not recommended ...T _ T
a = [{'table': 'a', 'field':['apple', 'pear']},
{'table': 'b', 'field':['grape', 'berry']}]
df=pd.DataFrame.from_records(a)
df[['Feild1','Feild2']]=df.field.apply(pd.Series)
pd.wide_to_long(df,['Feild'],'table','lol').reset_index().drop('lol',axis=1).sort_values('table')
Out[74]:
table Feild
0 a apple
2 a pear
1 b grape
3 b berry

Related

How to expand dictionaries in rows of pandas dataframe with unique column names?

I have a dataframe with rows as dictionaries as below:
Col1 A B
{'A': 1, 'B': 23} apple carrot
{'A': 3, 'B': 35} banana spinach
I want to expand Col1 such that the dataframe looks like this:
Col1.A Col2.B A B
1 23 apple carrot
3 35 banana spinach
How can I do this using pandas in python? Please let me know if there is any other way as well.
I tried using pd.explode but the new column names are being duplicated. How to avoid this?
df["Col1.A"] = df["Col1"].map(lambda x: x["A"])
df["Col1.B"] = df["Col1"].map(lambda x: x["B"])
df.drop("Col1", axis=1, inplace=True)
As a generic method that doesn't require knowledge of the dictionary keys:
df = (pd.json_normalize(df.pop('Col1'))
.add_prefix('Col1.').join(df)
)
Or, if you don't want to alter df:
out = (pd.json_normalize(df['Col1'])
.add_prefix('Col1.')
.join(df.drop(columns='Col1'))
)
Output:
Col1.A Col1.B A B
0 1 23 apple carrot
1 3 35 banana spinach None
To convert them to columns, you can use:
Col1 = df['Col1'].apply(pd.Series)
Result:
A B
0 1 23
1 3 35
Then, if you want, you can add this to your dataframe like this:
Col1.join(df.drop(columns='Col1'), lsuffix='_Col1')
Output:
A_Col1 B_Col1 A B
0 1 23 apple carrot
1 3 35 banana spinach

Find the frequency of words in a dataframe from a list

import pandas as pd
list = ['apple','banana','cherries','dragonfruit','elderberry']
data = {'name': ['Alpha', 'Bravo','Charlie','Delta','Echo'],
'favorite_fruit': ['apple banana cherries', 'banana cherries dragonfruit',
'cherries dragonfruit','dragonfruit','apple elderberry']}
df = pd.DataFrame (data, columns = ['name','favorite_fruit'])
I want to count the frequency of every fruit in the list within the df.
Expected Output:
df2
Fruit | Frequency
Apple | 2
Banana | 2
Cherries | 3
Dragonfruit | 3
Elderberry | 1
The code df.favorite_fruit.str.split(expand=True).stack().value_counts() works for a small DataFrame.
If df.favorite_fruit contains thousands of rows of different fruit combinations,
how do I find only the frequency of words in the list?
Maybe this is a loop-hole answer, but you can just filter out the values from the answer you already described. So if you start with this:
>>> df2 = df.favorite_fruit.str.split(expand=True).stack()
>>> df2
0 0 apple
1 banana
2 cherries
1 0 banana
1 cherries
2 dragonfruit
2 0 cherries
1 dragonfruit
3 0 dragonfruit
4 0 apple
1 elderberry
dtype: object
You could use isin to limit the data to ones in the target list:
>>> target = ['apple', 'banana']
>>> df2[df2.isin(target)].value_counts()
banana 2
apple 2
dtype: int64
Or even after your original answer:
>>> df.favorite_fruit.str.split(expand=True).stack().value_counts().loc[target]
apple 2
banana 2
dtype: int64
If the issue is that the expand and stack operations are costly with that much data, then maybe this won't be satisfactory. But I think it's possible this can be better than loop based answers?
Perhaps a bit of a roundabout way of doing it, but if your favorite_fruit column is always space delimited something like this should work:
import pandas as pd
list = ['apple','banana','cherries','dragonfruit','elderberry']
data = {'name': ['Alpha', 'Bravo','Charlie','Delta','Echo'],
'favorite_fruit': ['apple banana cherries', 'banana cherries dragonfruit',
'cherries dragonfruit','dragonfruit','apple elderberry']}
df = pd.DataFrame (data, columns = ['name','favorite_fruit'])
new_df = pd.DataFrame()
data = {}
for i, row in df.iterrows():
s = row['favorite_fruit']
items = s.split(' ')
for item in items:
if item in data.keys():
data[item].append(1)
else:
data[item] = [1]
for key, value in data.items():
data[key] = sum(value)
fruit = []
frequency = []
for key, value in data.items():
fruit.append(key)
frequency.append(value)
new_df = pd.DataFrame({'fruit': fruit, 'frequency':frequency})
print(new_df)
This prints out the following:
fruit frequency
0 apple 2
1 banana 2
2 cherries 3
3 dragonfruit 3
4 elderberry 1
Try using explode function after splitting.
df.favorite_fruit.str.split().explode().value_counts()
cherries 3
dragonfruit 3
banana 2
apple 2
elderberry 1
Name: favorite_fruit, dtype: int64

python: merge two database based on matching multiple columns in both the datasets and apply a script on the result

I have two databases with multiple column dataset-1(df1) has more than a couple of thousand rows and dataset-2(df2) is smaller... 300 rows.
I need to pickup a 'value' from column 3 in df2 based on matching 'fruit' in df1 with 'type' in df2 and 'expiry' in df1 with 'expiry' in df2
Furthermore, Instead of storing the 'Value' directly in a new column in df1, i need to perform a multiplication on the value in each row and the output gets to be stored in a new a column in df1.
So for example if expiry is 2 the value gets multiplied by 2 and if its 3 value gets multiplied by 3.. and so on and so forth!
I was able to solve this by using the code below, but.....:
for i in range(0, len(df1)):
df1_value = df2.loc[(df2['type'] == df1.iloc[i]['fruit']) & (df2['expiry'] == str(df1.iloc[i]['expiry'])].iloc[0]['value']
df1.loc[i, 'df_value'] = df1.iloc[i]['expiry']*df1_value
It creates two issues,
If the iteration throws up a null value (for example there is no 'value' for banana with expiry of 3 in df2), the process stops and it gives me an error -IndexError: single positional indexer is out-of-bounds
Because df1 has a very large number of rows, the individual iterations take a lot of time.
Is there a better way to handle this?
say df1:
fruit expiry category
apple 3 a
apple 3 b
apple 4 c
apple 4 d
orange 2 a
orange 2 b
orange 3 c
orange 3 d
orange 3 e
banana 3 a
banana 3 b
banana 3 c
banana 4 d
pineapple 2 a
pineapple 3 b
pineapple 3 c
pineapple 4 d
pineapple 4 e
df2:
type expiry value
apple 2 100
apple 3 110
apple 4 120
orange 2 200
orange 3 210
orange 4 220
banana 2 310
banana 4 320
pineapple 2 410
pineapple 3 420
pineapple 4 430
output: -revised df1
fruit expiry category df_value
apple 3 a 110*3=330
apple 3 b 110*3=330
apple 4 c 120*4=480
apple 4 d 120*4=480
orange 2 a 200...
orange 2 b 200...
orange 3 c 210...
orange 3 d 210...
orange 3 e 210...
banana 3 a 0
banana 3 b 0
banana 3 c 0
banana 4 d 320*4=1280
pineapple 2 a 410*2=820
pineapple 3 b 420...
pineapple 3 c 420...
pineapple 4 d 430....
pineapple 4 e 430....
As far as I know you can only do this by using SQL within python. SQL is used for relating different databases that have at least one column that is relatable (if you've used Power BI or Tableau you know what I mean) and querying multiple dataframes through their mutual relationships. I do not know this language so I cannot help you further than this.

Python Pandas Data Frame Inserting Many Arbitrary Values

Let's say I have a data frame that looks like this:
A
0 Apple
1 orange
2 pear
3 apple
For index values 4-1000, I want all of them to say "watermelon".
Any suggestions?
Reindex and fill NaNs:
df.reindex(np.r_[:1000]).fillna('watermelon')
Or,
df = df.reindex(np.r_[:1000])
df.iloc[df['A'].last_valid_index() + 1:, 0] = 'watermelon' # df.iloc[4:, 0] = "..."
A
0 Apple
1 orange
2 pear
3 apple
4 watermelon
5 watermelon
...
999 watermelon

Extract name of sum in Python Pandas

I have the following DataFrame called df:
KEY_ID READY STEADY GO
001 Yes Maybe 123
002 No Maybe 123
003 Yes Sometimes 234
004 Yes Later 234
005 No Sometimes 345
I use df.count() to see how many times a value is filled in which is 5 every time:
KEY_ID 5
READY 5
STEADY 5
GO 5
But I would also like to see how many times the values in column STEADY are used. I do this with abc = df['STEADY'].value_counts() which gives me:
Sometimes 2
Maybe 2
Later 1
With a for loop I can extract the information of the values in abc which I just created with value_counts() as follows:
for i in abc:
print(i)
However, I tried several methods, including
for i,j in enumerate(abc):
print(i); print(j)
to get the names of Sometimes, Maybe, Later as well as I don't want to type them manually. How do I extract these names of the value_counts() values?
Are you looking for groupby() ?
import pandas as pd
lst = [['Apple', 1], ['Orange', 1], ['Apple', 2], ['Orange', 1], ['Apple', 3], ['Orange', 1]]
df = pd.DataFrame(lst)
df.columns = ['fruit', 'amount']
df.groupby('fruit').sum()
import pandas as pd
rowdata = [['Apple', 1], ['Orange', 1], ['Apple', 2], ['Orange', 1], ['Apple', 3],['Orange', 1]]
df = pd.DataFrame(rowdata)
df.groupby(0).sum()
This will give like a data frame which is given below,
1
0
Apple 6
Orange 3
But just df.sum() will give like this,
0 AppleOrangeAppleOrangeAppleOrange
1 9
I hope you are expecting like the first one..
IIUC:
In [339]: df
Out[339]:
name val
0 Apple 1
1 Orange 1
2 Apple 2
3 Orange 1
4 Apple 3
5 Orange 1
In [340]: df.groupby('name', as_index=False)['val'].sum()
Out[340]:
name val
0 Apple 6
1 Orange 3
In [341]: df.groupby('name', as_index=False)['val'].sum()['name']
Out[341]:
0 Apple
1 Orange
Name: name, dtype: object
In [342]: df.groupby('name', as_index=False)['val'].sum()['name'].tolist()
Out[342]: ['Apple', 'Orange']
It seems you want filter first by boolean indexing with isin:
print (df)
A B
0 Peach 3
1 Pear 6
2 Apple 1
3 Orange 1
4 Apple 2
5 Orange 1
6 Apple 3
7 Orange 1
df1 = df[df['A'].isin(['Apple','Orange'])]
print (df1)
A B
2 Apple 1
3 Orange 1
4 Apple 2
5 Orange 1
6 Apple 3
7 Orange 1
Then groupby and aggregate sum:
df2 = df1.groupby('A', as_index=False)['B'].sum()
print (df2)
A B
0 Apple 6
1 Orange 3
Another solution is groupby and aggregate first and then select only values by list:
df1 = df.groupby('A')['B'].sum()
df2 = df1.loc[['Apple','Orange']].reset_index()
print (df2)
A B
0 Apple 6
1 Orange 3

Categories

Resources