keep order while using python pandas pivot - python

df = {'Region':['France','France','France','France'],'total':[1,2,3,4],'date':['12/30/19','12/31/19','01/01/20','01/02/20']}
df=pd.DataFrame.from_dict(df)
print(df)
Region total date
0 France 1 12/30/19
1 France 2 12/31/19
2 France 3 01/01/20
3 France 4 01/02/20
The dates are ordered. Now if I am using pivot
pandas_temp = df.pivot(index='Region',values='total', columns='date')
print(pandas_temp)
date 01/01/20 01/02/20 12/30/19 12/31/19
Region
France 3 4 1 2
I am losing the order. How can I keep it ?

Convert values to datetimes before pivot and then if necessary convert to your custom format:
df['date'] = pd.to_datetime(df['date'])
pandas_temp = df.pivot(index='Region',values='total', columns='date')
pandas_temp = pandas_temp.rename(columns=lambda x: x.strftime('%m/%d/%y'))
#alternative
#pandas_temp.columns = pandas_temp.columns.strftime('%m/%d/%y')
print (pandas_temp)
date 12/30/19 12/31/19 01/01/20 01/02/20
Region
France 1 2 3 4

Related

Python : Remodelling a dataframe and regrouping data from a specific column with predefined rows

Let's say that I have this dataframe with four columns : "Name", "Value", "Ccy" and "Group" :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','Dan_Age', 'Dan_city', 'Dan_country', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56','18', 'Berlin', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','USD','USD','','CHF', '','DKN','']
Group = ['0','0','0','1','1','1','1','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_Age 18 EUR 1
4 Dan_city Berlin EUR 1
5 Dan_country GER USD 1
6 Dan_sex M USD 1
7 Dan_Age 22 2
8 Dan_country FRA CHF 2
9 Dan_sex M 2
10 Dan_city Madrid DKN 3
11 Dan_country ESP 3
I want to represent this data differently before saving it in a csv. I would like to group the duplicates in the column "Name" with the associates values in "Values" and "Ccy". I want that the data in the column "Value" and "Ccy" are stored in the row(index) defined by the column "Group". Like that I do not mixed the data.
Then if the name is in the "group" 0, it means that it is general data so I would like that the all the rows from this "Name" are filled with the same value.
So I would like to get this result :
ID_Value Country_Value IBAN_Value Dan_age Dan_age_Ccy Dan_city_Value Dan_city_Ccy Dan_sex_Value
1 TAMARA GER FR56 18 EUR Berlin EUR M
2 TAMARA GER FR56 22 M
3 TAMARA GER FR56 Madrid DKN
I can not find how to do the first part. With the code below, I do not get what I want evn if I remove the columns empty
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
Anyone can help me !
Thank you
You can use the following. See comments in code for each step:
s = df.loc[df['Group'] == '0', 'Name'].tolist() # this variable will be used later according to Condition 2
df['Name'] = pd.Categorical(df['Name'], categories=df['Name'].unique(), ordered=True) #this preserves order before pivoting
df = df.pivot(index='Group', columns='Name') #transforms long-to-wide per expected output
for col in df.columns:
if col[1] in s: df[col] = df[col].shift().ffill() #Condition 2
df = df.iloc[1:].replace('',np.nan).dropna(axis=1, how='all').fillna('') #dataframe cleanup
df.columns = ['_'.join(col) for col in df.columns.swaplevel()] #column name cleanup
df
Out[1]:
ID_Value Country_Value IBAN_Value Dan_Age_Value Dan_city_Value \
Group
1 TAMARA_CO GERMANY FR56 18 Berlin
2 TAMARA_CO GERMANY FR56 22
3 TAMARA_CO GERMANY FR56 Madrid
Dan_country_Value Dan_sex_Value Dan_Age_Ccy Dan_city_Ccy \
Group
1 GER M EUR EUR
2 FRA M
3 ESP DKN
Dan_country_Ccy Dan_sex_Ccy
Group
1 USD USD
2 CHF
3
From there, you can drop columns you don't want, change strings from "TAMARA_CO" to "TAMARA", "GERMANY" to "GER", use reset_index(drop=True), etc.
You can do this quite easily with only 3 steps:
Split your data frame into 2 parts: the "general data" (which we want as a series) and the more specific data. Each data frame now contains the same kinds of information.
The key part of your problem: reorganizing the data. All you need is the pandas pivot function. It does exactly what you need!
Add the general information and the pivoted data back together.
# Split Data
general = df[df.Group == "0"].set_index("Name")["Value"].copy()
main_df = df[df.Group != "0"]
# Pivot Data
result = main_df.pivot(index="Group", columns=["Name"],
values=["Value", "Ccy"]).fillna("")
result.columns = [f"{c[1]}_{c[0]}" for c in result.columns]
# Create a data frame that has an identical row for each group
general_df = pd.DataFrame([general]*3, index=result.index)
general_df.columns = [c + "_Value" for c in general_df.columns]
# Merge the data back together
result = general_df.merge(result, on="Group")
The result given above does not give the exact column order you want, so you'd have to specify that manually with
final_cols = ["ID_Value", "Country_Value", "IBAN_Value",
"Dan_age_Value", "Dan_Age_Ccy", "Dan_city_Value",
"Dan_city_Ccy", "Dan_sex_Value"]
result = result[final_cols]

Replace the values in a column based on frequency

I have a dataframe (3.7 million rows) with a column with different country names
id Country
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 INDIA
6 USA
7 USA
8 ITALY
9 USA
10 RUSSIA
I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column
My alternate solution is to replace the names with there frequency using
df.column_name = df.column_name.map(df.column_name.value_counts())
Use:
df.loc[df.groupby('Country')['id']
.transform('size')
.div(len(df))
.lt(0.15),
'Country'] = 'Miscellanous'
Or
df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
.lt(0.15)),
'Country'] = 'Miscellanous'
If you want to put all country whose frequency is less than a threshold into the "Misc" category:
threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()
df['Country'].map(mappings)
Here is another option
s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')
You can use dictionary and map for this:
d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x if d[x] > 0.15 else 'Miscellanous' )
Output:
id
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 Miscellanous
6 USA
7 USA
8 Miscellanous
9 USA
10 RUSSIA
Name: Country, dtype: object

Count how many values occur in a month Pandas

I have a dataframe as follows;
Country
From Date
02/04/2020 Canada
04/02/2020 Ireland
10/03/2020 France
11/03/2020 Italy
15/03/2020 Hungary
.
.
.
10/10/2020 Canada
And I simply want to do a groupby() or something similar which will tell me how many times a country occurs per month
eg.
Canada Ireland France . . .
2010 1 3 4 1
2 4 3 2
.
.
.
10 4 4 4
Is there a simple way to do this?
Any help much appreciated!
A different angle to solve your question would be to use groupBy, count_values and unstack.
It goes like this:
I assume your "from date" is type date (datetime64[ns]) if not:
df['From Date']=pd.to_datetime(df['From Date'], format= '%d/%m/%Y')
convert the date to string with Year + Month:
df['From Date'] = df['From Date'].dt.strftime('%Y-%m')
group by From Date and count the values:
df.groupby(['From Date'])['Country'].value_counts().unstack().fillna(0).astype(int).reindex()
desired result (from the snapshot in your question):
Country Canada France Hungary Ireland Italy
From Date
2020-02 0 0 0 1 0
2020-03 0 1 1 0 1
2020-04 1 0 0 0 0
note the unstack that places the countries on the horizontal, astype(int) to avoid instances such as 1.0 and fillna(0) just in case any country has nothing - show zero.
Check with crosstab
# df.index=pd.to_datetime(df.index, format= '%d/%m/%Y')
pd.crosstab(df.index.strftime('%Y-%m'), df['Country'])

Populating new column in dataframe from existing two with different lengths

How would I go about adding a new column to an existing dataframe by comparing it to another that is shorter in length and has a different index.
For example, if I have:
df1 = country code year
0 Armenia a 2016
1 Brazil b 2017
2 Turkey c 2016
3 Armenia d 2017
df2 = geoCountry 2016_gdp 2017_gdp
0 Armenia 10.499 10.74
1 Brazil 1,798.62 2,140.94
2 Turkey 857.429 793.698
and I want to end up with:
df1 = country code year gdp
0 Armenia a 2016 10.499
1 Brazil b 2017 2,140.94
2 Turkey c 2016 857.429
3 Armenia d 2017 10.74
How would I go about this? I attempted to use answers outlined here and here to no avail. I also did the following which takes too long on a 90000 row dataframe
for index, row in df1.iterrows():
if row['country'] in list(df2.geoCountry):
if row['year'] == 2016:
df1['gdp'].append(df2[df2.geoCountry == str(row['country'])]['2016'])
else:
df1['gdp'].append(df2[df2.geoCountry == str(row['country'])]['2017'])
I guess this is what you're looking for:
df2 = df2.melt(id_vars = 'geoCountry', value_vars = ['2016_gdp', '2017_gdp'], var_name = ['year'])
df1['year'] = df1['year'].astype('int')
df2['year'] = df2['year'].str.slice(0,4).astype('int')
df1.merge(df2, left_on = ['country','year'], right_on = ['geoCountry','year'])[['country', 'code', 'year', 'value']]
Output:
country code year value
0 Armenia a 2016 10.499
1 Brazil b 2017 2,140.94
2 Turkey c 2016 857.429
3 Armenia d 2017 10.74
You mainly need the melt function:
df2.columns = df2.columns.str.split("_").str.get(0)
df2 = df2.rename(index=str, columns={"geoCountry": "country"})
df3 = pd.melt(df2, id_vars=['geoCountry'], value_vars=['2016','2017'],
var_name='year', value_name='gdp')
After this you simply merge the df1 with the above df3
result = pd.merge(df1, df3, on=['country','year'])
Output:
pd.merge(df1, df3, on=['country','year'])
Out[36]:
country code year gdp
0 Armenia a 2016 10.499
1 Brazil b 2017 2140.940
2 Turkey c 2016 857.429
3 Armenia d 2017 10.740

Pandas merge fail to extract common Index values

I'm trying to merge 2 DataFrames of different sizes, both are indexed by 'Country'. The first dataframe 'GDP_EN' contains every country in the world, and the second dataframe 'ScimEn' contains 15 countries.
When I try to merge these DataFrames,instead of merging the columns based on index countries of ScimEn, I got back 'Country_x' and 'Country_y'. 'Country_x' came from GDP_EN, which are the first 15 countries in alphabetical order. 'Country_y' are the 15 countries from ScimEn. I'm wondering why didn't they merge?
I used:
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
I think both DataFrames are not indexes by Country, but Country is column add parameter on='Country':
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]})
print (GDP_EN)
Country a
0 USA 4
1 France 8
2 Slovakia 6
3 Russia 9
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]})
print (ScimEn)
Country b
0 France 80
1 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
Country_x a Country_y b
0 USA 4 France 80
1 France 8 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,on='Country',how='right')
print (DF)
Country a b
0 France 8 80
1 Slovakia 6 70
If Country are indexes it works perfectly:
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]}).set_index('Country')
print (GDP_EN)
a
Country
USA 4
France 8
Slovakia 6
Russia 9
print (GDP_EN.index)
Index(['USA', 'France', 'Slovakia', 'Russia'], dtype='object', name='Country')
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]}).set_index('Country')
print (ScimEn)
b
Country
France 80
Slovakia 70
print (ScimEn.index)
Index(['France', 'Slovakia'], dtype='object', name='Country')
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
a b
Country
France 8 80
Slovakia 6 70

Categories

Resources