Create dataframe from dictionary where arrays are of unequal length

Create dataframe from dictionary where arrays are of unequal length - python

I have a dictionary - {'Car': ['a', 'b'], 'Bike': ['q', 'w', 'e']}
I want to generate a data frame like this -
S.no. | vehicle | model
1 | Car | a
2 | Car | b
2 | Bike | q
2 | Bike | w
2 | Bike | e
I tried df = pd.DataFrame(vDict) but I get ValueError: arrays must all be same length error. Help please?

Use:
pd.Series(dct, name='model').explode().rename_axis(index='vehicle').reset_index()

We can use pd.DataFrame.from_dict here, then use stack and finally clean up our index and column names:
dct = {'Car': ['a', 'b'], 'Bike': ['q', 'w', 'e']}
df = pd.DataFrame.from_dict(dct, orient='index').stack()
df = df.reset_index(level=0, name='model').rename(columns={'level_0':'vehicle'})
df = df.reset_index(drop=True)
vehicle model
0 Car a
1 Car b
2 Bike q
3 Bike w
4 Bike e

Related

pandas join on columns which contains a list - match any

I have two dataframes
I want to join on a column where one of the column is a list,
need to join if any value in list matches
df1 =
| index | col_1 |
| ----- | ----- |
| 1 | 'a' |
| 2 | 'b' |
df2 =
| index_2 | col_1 |
| ------- | ----- |
| A | ['a', 'c'] |
| B | ['a', 'd', 'e'] |
I am looking something like
df1.join(df2, on='col_1', type_=any, type='left')
| index |col_1_x |index_2|col_1_y |
| ----- |--------|_______| ----- |
| 1 |'a' | A |['a', 'c'] |
| 1 |'a' | A |['a', 'd', 'e']|
```

You can use explode and then use merge like so:
import pandas as pd
# Create the input dataframes
df1 = pd.DataFrame({'index': [1, 2], 'col_1': ['a', 'b']})
df2 = pd.DataFrame({'index_2': ['A', 'B'], 'col_1': [['a', 'c'], ['a', 'd', 'e']]})
# Explode the list column in df2 to multiple rows
df2_exploded = df2.explode('col_1')
# Perform a regular join on the common column
result = df1.merge(df2_exploded, left_on='col_1', right_on='col_1', how='left')
# Get the "col_1" from un-exploded data
result = result.merge(df2, on='index_2', how='left').dropna()
df_exploded looks like this:
index_2 col_1
0 A a
0 A c
1 B a
1 B d
1 B e
The final result looks like this:
index col_1_x index_2 col_1_y
0 1 a A [a, c]
1 1 a B [a, d, e]

You can do the following :
import pandas as pd
df1 = pd.DataFrame({'index': [1, 2], 'col_1': ['a', 'b']})
df2 = pd.DataFrame({'index_2': ['A', 'B'], 'col_1': [['a', 'c'], ['a', 'd', 'e']]})
# check for matches
def any_match(list1, list2):
if list1 is None or list2 is None:
return False
return any(x in list2 for x in list1)
# join the dataframes based on matching values
result = pd.merge(df1, df2, how='cross')
result = result[result.apply(lambda x: any_match(x['col_1_x'], x['col_1_y']), axis=1)]
print(result[['index', 'col_1_x', 'index_2', 'col_1_y']])
which returns:
index col_1_x index_2 col_1_y
0 1 a A [a, c]
1 1 a B [a, d, e]

Pandas: Splitting a column by delimiter and re-arrenging based on other columns

Let df be a data frame.
In [1]: import pandas as pd
...: df = pd.DataFrame(columns = ['Home', 'Score', 'Away'])
...: df.loc[0] = ['Team A', '3-1', 'Team B']
...: df.loc[1] = ['Team B', '2-1', 'Team A']
...: df.loc[2] = ['Team B', '2-2', 'Team A']
...: df.loc[3] = ['Team A', '0-1', 'Team B']
In [2]: df
Out[2]:
Home Score Away
0 Team A 3-1 Team B
1 Team B 2-1 Team A
2 Team B 2-2 Team A
3 Team A 0-1 Team B
I want to make df_1 out of df.
In [4]: df_1
Out[4]:
Team A Team B
0 3 1
1 1 2
2 2 2
3 0 1
What is the easiest way?
As a beginner, I can split the 'Score' column into two columns and then loop over the other columns and get df_1, but I guess there should be an easier way of doing that, probably with a lambda function or group_by method.
Any ideas?

You can try this:
df["values"] = df.apply(lambda row: {row["Home"]:row["Score"].split("-")[0], row["Away"]:row["Score"].split("-")[1]}, axis=1)
output_df = pd.DataFrame(df["values"].tolist())
Output:
Team A Team B
0 3 1
1 1 2
2 2 2
3 0 1

If it is just two teams, we can revert the score if needed.
Where functions in the following way, if the condition is true, it keeps original value. If not, it can call input value from a list of values. Our condition is on the team and mapper is a reversal of a string.
l_rev_string = lambda s: s[::-1]
df_score_rev = df.Score.apply(l_rev_string)
df1 = df.Score.where(df.Home == 'Team A', df_score_rev)\
.str.split('-',expand=True)\
.rename(columns = {0:'Team A',1:'Team B'})
| | Team A | Team B |
|---:|---------:|---------:|
| 0 | 3 | 1 |
| 1 | 1 | 2 |
| 2 | 2 | 2 |
| 3 | 0 | 1 |

How to store index of duplicated rows in pandas dataframe?

My dataset looks like below:
+--------+----------+-----------+--------------------+
| | FST_NAME | LAST_NAME | EMAIL_ADDR |
+--------+----------+-----------+--------------------+
| ROW_ID | | | |
| 1-123 | Will | Smith | will.smith#abc.com |
| 1-124 | Dan | Brown | dan.brown#xyz.com |
| 1-125 | Will | Smith | will.smith#abc.com |
| 1-126 | Dan | Brown | dan.brown#xyz.com |
| 1-127 | Tom | Cruise | tom.cruise#abc.com |
| 1-128 | Will | Smith | will.smith#abc.com |
+--------+----------+-----------+--------------------+
I am trying to count duplicate rows by keeping the first record and store all the duplicated row index in a column.
I tried below. It gives me the count but i am unable to group the duplicated index.
df.groupby(df.columns.tolist(),as_index=False).size()
How can I get the duplicated row index?

Try:
df.reset_index().groupby(df.columns.tolist())["index"].agg(list).reset_index()
To get exactly what you want:
res=df.reset_index().groupby(df.columns.tolist())["index"].agg(list).reset_index().rename(columns={"index": "duplicated"})
res.index=res["duplicated"].str[0].tolist()
res["duplicated"]=res["duplicated"].str[1:]
Outputs (dummy data):
#original df:
a b
a1 x 4
a2 y 3
b6 z 2
c7 x 4
d x 4
x y 3
#transformed one:
a b duplicated
a1 x 4 [c7, d]
a2 y 3 [x]
b6 z 2 []

Not a very efficient way, just that it can be used as a solution
df2 = df.drop_duplicates()
This will result as df2 =
Name1 Name2
0 Will Smith
1 Dan Brown
4 Tom Cruise
Now,
lis = []
for i in df2.iterrows():
lis.append(i[0])
This will make lis = [0, 1, 4]. All the indexes from 0 to len(df) that are not in lis, are the indexes that contain duplicates.

For df like:
FST_NAME L_NAME email
0 w s ws
1 d b db
2 w s ws
3 z z zz
Get grouped index into lists
import pandas as pd
df = pd.DataFrame({'FST_NAME': ['w', 'd', 'w', 'z'], 'L_NAME': ['s', 'b', 's', 'z'], 'email': ['ws', 'db', 'ws', 'zz']})
df = df.groupby(df.columns.tolist()).apply(lambda row: pd.Series({'duplicated': list(row.index)}))
Output:
duplicated
FST_NAME L_NAME email
d b db [1]
w s ws [0, 2]
z z zz [3]

Plot in python after crosstab merge

I'd like to plot my DataFrame. I had this DF first:
id|project|categories|rating
1 | a | A | 1
1 | a | B | 1
1 | a | C | 2
1 | b | A | 1
1 | b | B | 1
2 | c | A | 1
2 | c | B | 2
used this code:
import pandas as pd
df = pd.DataFrame(...)
(df.groupby('id').project.nunique().reset_index()
.merge(pd.crosstab(df.id, df.categories).reset_index()))
and now got this DataFrame:
id | project | A | B | C |
1 | 2 | 2 | 2 | 1 |
2 | 1 | 1 | 1 | 0 |
Now I'd like to plot the DF. I want to show, if the number of projects depends on how many categories are affected, or which categories are affected. I know how to visualize dataframes, but after crosstab and merging, it is not working as usual

I reproduced your data using below code:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 1, 1, 2, 2,],\
'project': ['a', 'a', 'a', 'b', 'b', 'c', 'c'],\
'categories': ['A', 'B', 'C', 'A', 'B', 'A', 'B'],\
'rating': [1, 1, 2, 1, 1, 1, 2]})
Now data looks like this
categories id project rating
0 A 1 a 1
1 B 1 a 1
2 C 1 a 2
3 A 1 b 1
4 B 1 b 1
5 A 2 c 1
6 B 2 c 2
If you want to plot 'category count' as a function of 'project count' it looks like this.
import matplotlib.pyplot as plt
# this line is your code
df2 = df.groupby('id').project.nunique().reset_index().merge(pd.crosstab(df.id, df.categories).reset_index())
plt.scatter(df2.project, df2.A, label='A', alpha=0.5)
plt.scatter(df2.project, df2.B, label='B', alpha=0.5)
plt.scatter(df2.project, df2.C, label='C', alpha=0.5)
plt.xlabel('project count')
plt.ylabel('category count')
plt.legend()
plt.show()
And you will get this

Rolling up data frame along with count of rows in python

I am still in a learning phase in python and wanted to know how do we roll up the data and count the duplicate data rows in a column called count
The data frame structure is as follows
Col1| Value
A | 1
B | 1
A | 1
B | 1
C | 3
C | 3
C | 3
C | 3
My result should be as follows
Col1|Value|Count
A | 1 | 2
B | 1 | 2
C | 3 | 4

>>> df2 = df.groupby(['Col1', 'Value']).size().reset_index()
>>> df2.columns = ['Col1', 'Value', 'Count']
>>> df2
Col1 Value Count
0 A 1 2
1 B 1 2
2 C 3 4

Roman Pekar's fine answer is correct for this case. However, I saw it after trying to write a solution for the general case stated in the text of your question, not just the example with specific column names. So, for the general case, consider:
df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
For example:
import pandas as pd
df = pd.DataFrame({'Col1': ['a', 'a', 'a', 'b', 'c'], 'Value': [1, 2, 1, 3, 2]})
>>> df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
Col1 Value Count
0 a 1 2
1 a 2 1
2 b 3 1
3 c 2 1

You can also try:
df.groupby('Col1')['Value'].value_counts().reset_index(name='Count')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create dataframe from dictionary where arrays are of unequal length - python

Use: pd.Series(dct, name='model').explode().rename_axis(index='vehicle').reset_index()

Related

pandas join on columns which contains a list - match any

Pandas: Splitting a column by delimiter and re-arrenging based on other columns

How to store index of duplicated rows in pandas dataframe?

Plot in python after crosstab merge

Rolling up data frame along with count of rows in python

Categories

Resources