Pandas dataframe to dictionary with specific format - python

I have a df like this:
Customer# Facility Transp
1 RS 4
2 RS 7
3 RS 9
1 CM 2
2 CM 8
3 CM 5
I want to convert to a dictionary that looks like this:
transp = {'RS' : {1 : 4, 2 : 7, 3 : 9,
'CM' : {1 : 2, 2 : 8, 3 : 5}}
I'm unfamiliar with this conversion. I tried various options. The data has to be exactly in this dictionary format. I can't have nesting with []. Essentially Facility is the primary level then Customer / Transp. I feel like this should be easy.... Thanks,

You can do it in one go.
df = pd.DataFrame({"Customer#": [1, 2, 3, 1, 2, 3],
"Facility": ['RS', 'RS', 'RS', 'CM', 'CM', 'CM'],
"Transp": [4, 7, 9, 2, 8, 5]})
transp = df.groupby('Facility')[['Customer#','Transp']].apply(lambda g: dict(g.values.tolist())).to_dict()
print(transp)

Related

After mapping data in order to change to integer, variable only showing NaN value

So I was having trouble with a NaN error where asking for df['column'] was only showing NaN and I've narrowed it down to this specific part of the code and i think it has something to do with the way I have mapped the data. Does anyone have any idea?
My code is below:
df['country_code'] = df['country_code'].replace(['?'], ) - *there were some '?' values so I wanted to make this empty so that i could later replace with the mean once I'd converted everything to integer*
country_code_map = {'AUS': 1, 'USA': 2, 'CAN': 3, 'BGD': 4, 'BRZ': 5, 'JP': 6, 'ID': 7, 'HR': 8, 'CH': 9, 'FRA': 10, 'FIN': 11}
df['country_code'] = df['country_code'].map(country_code_map)
df['country_code'] = pd.to_numeric(df['country_code'])
df['country_code'] = df['country_code'].replace([''], df['country_code'].mean)
Let me know if any extra info req'd.
I've created df['country_code'] in the following way, you should have something similar:
import pandas as pd
d = {'country_code': ["?", "BRZ", "USA"]}
df = pd.DataFrame(data=d)
print(df)
Output:
country_code
0 ?
1 BRZ
2 USA
Now if I execute your code, this is what I get:
country_code
0 NaN
1 5.0
2 2.0
You're getting a NaN value in the output instead of a mean over the column for the following reason.
Let's take a look at this line:
df['country_code'] = df['country_code'].replace(['?'], )
print(df)
Output:
country_code
0 NaN
1 5.0
2 2.0
Here you're not erasing the ?s leaving the place empty, but you're filling it with NaN values.
So when you get to the last line, what you're trying to do is to replace empty strings '', but you have NaNs. What you should use instead is DataFrame.fillna, to fill the NaNs, like this:
df['country_code'] = df['country_code'].replace(['?'], )
country_code_map = {'AUS': 1, 'USA': 2, 'CAN': 3, 'BGD': 4, 'BRZ': 5, 'JP': 6, 'ID': 7, 'HR': 8, 'CH': 9, 'FRA': 10, 'FIN': 11}
df['country_code'] = df['country_code'].map(country_code_map)
df['country_code'] = pd.to_numeric(df['country_code'])
df['country_code'] = df['country_code'].fillna(df['country_code'].mean())
Output:
country_code
0 3.5
1 5.0
2 2.0
So I realised the issue was in my mapping and converting to an integer. It will automatically do this once I have mapped the data.
Therefore the code should look like this:
country_code_map = {'AUS': 1, 'USA': 2, 'CAN': 3, 'BGD': 4, 'BRZ': 5, 'JP': 6, 'ID': 7, 'HR': 8, 'CH': 9, 'FRA': 10, 'FIN': 11}
df['country_code'] = df['country_code'].map(country_code_map)
Then I can check the mean without getting the NaN values as I was before:
df['country_code'].mean)

How to find percent change by row within groups python pandas

I have a sample of my dataframe as follows:
data = {'retailer': [2, 2, 2, 2, 2, 5, 5, 5, 5, 5],
'store': [1, 1, 1, 1, 1, 7, 7, 7, 7, 7],
'week':[2021110701, 2021101301, 2021100601, 2021092901, 2021092201, 2021110701, 2021101301, 2021100601, 2021092901, 2021092201],
'dollars': [353136.2, 379263.8, 507892.1, 491528.2, 503602.8, 435025.2, 406698.5, 338383.5, 360845.1, 372385.2]
}
data = pd.DataFrame(data)
I have sorted my columns by doing
data = data.sort_values(['retailer', 'store', 'week'], ascending=(True, True, False))
I would like to find the percent different in dollars between each row WITHIN each group...so essentially group by retailer, then store and then find the percent difference between the rows for 'dollars' between the week and the week below it, and then save the value in a column next to the dollars.
Basically have the output look like:
data = {'retailer': [2, 2, 2, 2, 2, 5, 5, 5, 5, 5],
'store': [1, 1, 1, 1, 1, 7, 7, 7, 7, 7],
'week':[2021110701, 2021101301, 2021100601, 2021092901, 2021092201, 2021110701, 2021101301, 2021100601, 2021092901, 2021092201],
'dollars': [353136.2, 379263.8, 507892.1, 491528.2, 503602.8, 435025.2, 406698.5, 338383.5, 360845.1, 372385.2],
'pc_diff': [-0.06889030801252315, -0.2532591075939161, 0.03329188437204613, -0.02397643539710259, 'NaN', 0.06965036753270545, 0.20188632128930636, -0.062247208012523876, -0.030989684874694362, 'NaN']
}
data = pd.DataFrame(data)
So for retailer 2, store 1 trying to find the percent difference between week 2021110701 and 2021101301 which would be (353136.2 - 379263.8)/379263.8.
The NAs exist because there is no row below that one so there is nothing to find the percent change between (if that makes sense). Is there a way I can do this/is there a pandas equivalent of using a lag function?
You can use groupby+pct_change:
data.groupby(['retailer', 'store'])['dollars'].pct_change(-1)
output:
retailer store week dollars pc_diff
0 2 1 2021110701 353136.2 -0.068890
1 2 1 2021101301 379263.8 -0.253259
2 2 1 2021100601 507892.1 0.033292
3 2 1 2021092901 491528.2 -0.023976
4 2 1 2021092201 503602.8 NaN
5 5 7 2021110701 435025.2 0.069650
6 5 7 2021101301 406698.5 0.201886
7 5 7 2021100601 338383.5 -0.062247
8 5 7 2021092901 360845.1 -0.030990
9 5 7 2021092201 372385.2 NaN

how to assign categorical values according to numbers in a column of a dataframe?

I have a data frame with a column 'score'. It contains scores from 1 to 10. I want to create a new column "color" which gives the column color according to the score.
For e.g. if the score is 1, the value of color should be "#75968f", if the score is 2, the value of color should be "#a5bab7". i.e. we need colors ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2","#f1d4Af", "#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"] for scores [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] respectively.
Is it possible to do this without using a loop?
Let me know in case you have a problem understanding the question.
Use Series.map with dictionary generated by zipping both lists or if need range by length of list colors is possible use enumerate:
df = pd.DataFrame({'score':[2,4,6,3,8]})
colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2","#f1d4Af",
"#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"]
scores = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df['new'] = df['score'].map(dict(zip(scores, colors)))
df['new1'] = df['score'].map(dict(enumerate(colors, 1)))
print (df)
score new new1
0 2 #a5bab7 #a5bab7
1 4 #e2e2e2 #e2e2e2
2 6 #dfccce #dfccce
3 3 #c9d9d3 #c9d9d3
4 8 #cc7878 #cc7878

groupby count of values not equal to other col value pandas

I'm aiming to pass a groupby count of values but only considering rows where Item and Item 2 are different. The following achieves this but drops rows if no values are different. If there are one or more values that are present but are identical between Item and Item 2 then I'm hoping to return 0.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,4,4,4],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A','B','B','B'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A','B','A','A'],
'Value' : [5, 6, 6, 5, 5, 6, 5, 6, 3, 1, 4, 6, 7, 4, 5, 1, 2, 3],
})
df1 = df[df['Item'] != df['Item2']].groupby(['Time']).size().reset_index(name='count')
Intended Output:
Time count
0 1 4
1 2 3
2 3 0
3 4 2
Edit 2:
df = pd.DataFrame({
'Time' : ['1','1','1','1','1','1','1','2','2','2','2','2','2','2','3','4','4','4'],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A','B','B','B'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A','B','A','A'],
'Value' : [2, 6, 6, 5, 3, 3, 4, 6, 5, 1, 4, 6, 7, 4, 5, 1, 2, 3],
})
df1 = (df.assign(new = df['Item'] != df['Item2'])
.groupby('Time')['new']
.mean()
.reset_index(name='avg')
)
Intended Output:
Time avg
0 1 3.0
1 2 5.0
2 3 0.0
3 4 2.5
Idea is not filter, bur count values of Trues per groups by sum, here is passed Series df['Time'] to groupby:
df1 = (df['Item'] != df['Item2']).groupby(df['Time']).sum().reset_index(name='count')
print (df1)
Time count
0 1 4
1 2 3
2 3 0
3 4 2
Another similar solution is create new helper column and aggregate it:
df1 = (df.assign(new = df['Item'] != df['Item2'])
.groupby('Time')['new']
.sum()
.reset_index(name='count'))
EDIT: You can replace non matched values to misisng values by Series.where and then replace misisng values by fillna
df1 = (df.assign(new = df['Value'].where(df['Item'] != df['Item2']))
.groupby('Time')['new']
.mean()
.fillna(0)
.reset_index(name='avg')
)
print (df1)
Time avg
0 1 3.0
1 2 5.0
2 3 0.0
3 4 2.5
Alternative is use Series.reindex by uniqu values of original Time column:
df1 = (df[df['Item'] != df['Item2']]
.groupby(['Time'])['Value']
.mean()
.reindex(df['Time'].unique(), fill_value=0)
.reset_index(name='avg'))
Have a look at the pivot tables for pandas
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Time' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3],
'Item' : ['A','A','A','A','A','A','A','B','B','B','B','B','B','B','A'],
'Item2' : ['B','A','A','A','B','B','B','A','A','B','A','B','B','B','A'],
'Value' : [5, 6, 6, 5, 5, 6, 5, 6, 3, 1, 4, 6, 7, 4, 5],
})
# this gives you just the ones were there is a differance
df2 = df[df['Item'] != df['Item2']]
# then sum up the numbers for each item
pd.pivot_table(df2,index='Time',aggfunc='count')
This gives you the table
Item Item2 Value
Time
1 4 4 4
2 3 3 3

Groupby columns on ID and month and assign value for each month as new colmuns

I have a dataset where i groupby the monthly data with the same id:
temp1 = listvar[2].groupby(["id", "month"])["value"].mean()
This results in this:
id month
SN10380 1 -9.670370
2 -8.303571
3 -4.932143
4 0.475862
5 5.732000
...
SN99950 8 6.326786
9 4.623529
10 1.290566
11 -0.867273
12 -2.485455
I then want to have each month and the corresponding value as a own column on the same ID, like this:
id month_1 month_2 month_3 month_4 .... month_12
SN10380 -9.670370 -8.303571 .....
SN99950
I have tried different solutions using apply(), transform() and agg(), but aren't able to produce the wanted output.
You could use unstack. Here's the sample code:
import pandas as pd
df = pd.DataFrame({
"id": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"month": [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
"value": [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
})
temp1 = df.groupby(["id", "month"])["value"].mean()
temp1.unstack()
I hope it helps!

Categories

Resources