I'm now processing tweet data using python pandas module,
and I stuck with the problem.
I want to make a frequency table(pandas dataframe) from this dictionary:
d = {"Nigeria": 9, "India": 18, "Saudi Arabia": 9, "Japan": 60, "Brazil": 3, "United States": 38, "Spain": 5, "Russia": 3, "Ukraine": 3, "Azerbaijan": 5, "China": 1, "Germany": 3, "France": 12, "Philippines": 8, "Thailand": 5, "Argentina": 9, "Indonesia": 3, "Netherlands": 8, "Turkey": 2, "Mexico": 9, "Italy": 2}
desired output is:
>>> import pandas as pd
>>> df = pd.DataFrame(?????)
>>> df
Country Count
Nigeria 9
India 18
Saudi Arabia 9
.
.
.
(no matter if there's index from 0 to n at the leftmost column)
Can anyone help me to deal with this problem?
Thank you in advance!
You have only a single series (a column of data with index values), really, so this works:
pd.Series(d, name='Count')
You can then construct a DataFrame if you want:
df = pd.DataFrame(pd.Series(d, name='Count'))
df.index.name = 'Country'
Now you have:
Count
Country
Argentina 9
Azerbaijan 5
Brazil 3
...
Use DataFrame constructor and pass values and keys separately to columns:
df = pd.DataFrame({'Country':list(d.keys()),
'Count': list(d.values())}, columns=['Country','Count'])
print (df)
Country Count
0 Azerbaijan 5
1 Indonesia 3
2 Germany 3
3 France 12
4 Mexico 9
5 Italy 2
6 Spain 5
7 Brazil 3
8 Thailand 5
9 Argentina 9
10 Ukraine 3
11 United States 38
12 Turkey 2
13 Nigeria 9
14 Saudi Arabia 9
15 Philippines 8
16 China 1
17 Japan 60
18 Russia 3
19 India 18
20 Netherlands 8
Pass it as a list
pd.DataFrame([d]).T.rename(columns={0:'count'})
That might get the work done but will kill the performance since we are saying the keys are columns and then transposing it. So since d.items() gives us the tuples we can do
df = pd.DataFrame(list(d.items()),columns=['country','count'])
df.head()
country count
0 Germany 3
1 Philippines 8
2 Mexico 9
3 Nigeria 9
4 Saudi Arabia 9
Related
I have a data frame which headers look like that:
Time Peter_Price, Peter_variable 1, Peter_variable 2, Maria_Price, Maria_variable 1, Maria_variable 3,John_price,...
2017 12 985685466 Street 1 12 4984984984 Street 2
2018 10 985785466 Street 3 78 4984974184 Street 8
2019 12 985685466 Street 1 12 4984984984 Street 2
2020 12 985685466 Street 1 12 4984984984 Street 2
2021 12 985685466 Street 1 12 4984984984 Street 2
What would be the best multi-index to compare variables by group later such as what person has the highest variable 3 or the trend of all variable 3 by people
I think that what I need is something like this but I accept other suggestions (this is my first approach with multi-index).
Peter Maria John
Price, variable 1, variable 2, Price, variable 1, variable 3, Price,...
Time
You can try this:
Create data
import pandas as pd
import numpy as np
import itertools
people = ["Peter", "Maria"]
vars = ["Price", "variable 1", "variable 2"]
columns = ["_".join(x) for x in itertools.product(people, vars)]
df = (pd.DataFrame(np.random.rand(10, 6), columns=columns)
.assign(time=np.arange(2012, 2022))
print(df.head())
Peter_Price Peter_variable 1 Peter_variable 2 Maria_Price Maria_variable 1 Maria_variable 2 time
0 0.542336 0.201243 0.616050 0.313119 0.652847 0.928497 2012
1 0.587392 0.143169 0.594997 0.553803 0.249188 0.076633 2013
2 0.447318 0.410310 0.443391 0.947064 0.476262 0.230092 2014
3 0.285560 0.018005 0.869387 0.165836 0.399670 0.307120 2015
4 0.422084 0.414453 0.626180 0.658528 0.286265 0.404369 2016
Snippet to try
new_df = df.set_index("time")
new_df.columns = new_df.columns.str.split("_", expand=True)
print(new_df.head())
Peter Maria
Price variable 1 variable 2 Price variable 1 variable 2
time
2012 0.542336 0.201243 0.616050 0.313119 0.652847 0.928497
2013 0.587392 0.143169 0.594997 0.553803 0.249188 0.076633
2014 0.447318 0.410310 0.443391 0.947064 0.476262 0.230092
2015 0.285560 0.018005 0.869387 0.165836 0.399670 0.307120
2016 0.422084 0.414453 0.626180 0.658528 0.286265 0.404369
Then you can use the xs method to subselect specific variables for an individual level analysis. Subsetting to only "variable 2"
>>> new_df.xs("variable 2", level=1, axis=1)
Peter Maria
time
2012 0.616050 0.928497
2013 0.594997 0.076633
2014 0.443391 0.230092
2015 0.869387 0.307120
2016 0.626180 0.404369
2017 0.443827 0.544415
2018 0.425426 0.176707
2019 0.454269 0.414625
2020 0.863477 0.322609
2021 0.902759 0.821789
Example analysis: For each year, who has the higher "Price"
>>> new_df.xs("Price", level=1, axis=1).idxmax(axis=1)
time
2012 Peter
2013 Peter
2014 Maria
2015 Peter
2016 Maria
2017 Peter
2018 Maria
2019 Peter
2020 Maria
2021 Peter
dtype: object
Try:
df=df.set_index('Time')
df.columns = pd.MultiIndex.from_tuples([x.split('_') for x in df.columns])
Output:
Peter Maria
Price variable1 variable2 Price variable1 variable3
Time
2017 12 985685466 Street 1 12 4984984984 Street 2
2018 10 985785466 Street 3 78 4984974184 Street 8
2019 12 985685466 Street 1 12 4984984984 Street 2
2020 12 985685466 Street 1 12 4984984984 Street 2
2021 12 985685466 Street 1 12 4984984984 Street 2
recently I am doing with this data set
import pandas as pd
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','State','Sales'])
df1
I want to find the 3 groups that have the highest sales
grouped_df1 = df1.groupby('State')
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False))
So I have a dataframe like this
Now, I want to find the top 3 State that have the highest sales.
I tried to use
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).head(3)
# It gives me the first three rows
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).max()
#It only gives me the maximum value
The expected result should be:
Texas: 31
California: 24
North Carolina: 18
Thus, how can I fix it? Because sometimes, a State can have 3 top sales, for example Alaska may have 3 top sales. When I simply sort it, the top 3 will be Alaska, and it cannot find 2 other groups.
Many thanks!
You could add a new column called Sales_Max_For_State and then use drop_duplicates and nlargest:
>>> df1['Sales_Max_For_State'] = df1.groupby(['State'])['Sales'].transform(max)
>>> df1
Product State Sales Sales_Max_For_State
0 Box Alaska 14 16
1 Bottles California 24 24
2 Pen Texas 31 31
3 Markers North Carolina 12 18
4 Bottles California 13 24
5 Pen Texas 7 31
6 Markers Alaska 9 16
7 Bottles Texas 31 31
8 Box North Carolina 18 18
9 Markers Alaska 16 16
10 Markers California 18 24
11 Pen Texas 14 31
>>> df2 = df1.drop_duplicates(['Sales_Max_For_State']).nlargest(3, 'Sales_Max_For_State')[['State', 'Sales_Max_For_State']]
>>> df2
State Sales_Max_For_State
2 Texas 31
1 California 24
3 North Carolina 18
I think there are a few ways to do this:
1-
df1.groupby('State').agg({'Sales': 'max'}).sort_values(by='Sales', ascending=False).iloc[:3]
2-df1.groupby('State').agg({'Sales': 'max'})['Sales'].nlargest(3)
Sales
State
Texas 31
California 24
North Carolina 18
I have a list of lists like this.
sports = [['Sport', 'Country(s)'], ['Foot_ball', 'brazil'], ['Volleyball', 'Argentina', 'India'], ['Rugger', 'New_zealand', ‘South_africa’], ['Cricket', 'India'], ['Carrom', 'Uk', ‘Usa’], ['Chess', 'Uk']]
I want to create panda data frame using the above lists as follows:
sport Country(s)
Foot_ball brazil
Volleyball Argentina
Volleyball india
Rugger New_zealnd
Rugger South_africa
Criket India
Carrom UK
Carrom Usa
Chess UK
I was trying like this
sport_x = []
for x in sports[1:]:
sport_x.append(x[0])
print(sport_x)
country = []
for y in sports[1:]:
country.append(y[1:])
header = sports[0]
df = pd.DataFrame([sport_x,country], columns = header)
halfway through, i m getting this error
But i was getting this error.
AssertionError: 2 columns passed, passed data had 6 columns
Any suggestions, how to do this.
Something like this to first "expand" the irregularly shaped rows, then dataframefy them.
>>> sports = [
["Sport", "Country(s)"],
["Foot_ball", "brazil"],
["Volleyball", "Argentina", "India"],
["Rugger", "New_zealand", "South_africa"],
["Cricket", "India"],
["Carrom", "Uk", "Usa"],
["Chess", "Uk"],
]
>>> expanded_sports = []
>>> for row in sports:
... for country in row[1:]:
... expanded_sports.append((row[0], country))
...
>>> pd.DataFrame(expanded_sports[1:], columns=expanded_sports[0])
Sport Country(s)
0 Foot_ball brazil
1 Volleyball Argentina
2 Volleyball India
3 Rugger New_zealand
4 Rugger South_africa
5 Cricket India
6 Carrom Uk
7 Carrom Usa
8 Chess Uk
>>>
EDIT: Another solution using .melt(), but this looks uglier to me, and the order isn't the same.
>>> pd.DataFrame(sports[1:]).melt(0, value_name='country').dropna().drop('variable', axis=1).rename({0: 'sport'}, axis=1)
sport country
0 Foot_ball brazil
1 Volleyball Argentina
2 Rugger New_zealand
3 Cricket India
4 Carrom Uk
5 Chess Uk
7 Volleyball India
8 Rugger South_africa
10 Carrom Usa
Or, pandas way using explode and list comprehension:
df=pd.DataFrame([[i[0],','.join(i[1:])] if len(i)>2 else i for i in sports[1:]],
columns=sports[0])
df['Country(s)']=df['Country(s)'].str.split(',')
final=df.explode('Country(s)').reset_index(drop=True)
Sport Country(s)
0 Foot_ball brazil
1 Volleyball Argentina
2 Volleyball India
3 Rugger New_zealand
4 Rugger South_africa
5 Cricket India
6 Carrom Uk
7 Carrom Usa
8 Chess Uk
The below is my dataframe :
Sno Name Region Num
0 1 Rubin Indore 79744001550
1 2 Rahul Delhi 89824304549
2 3 Rohit Noida 91611611478
3 4 Chirag Delhi 85879761557
4 5 Shan Bharat 95604535786
5 6 Jordi Russia 80777784005
6 7 El Russia 70008700104
7 8 Nino Spain 87707101233
8 9 Mark USA 98271377772
9 10 Pattinson Hawk Eye 87888888889
Retrieve the numbers and store it region wise from the given CSV file.
delhi_list = []
for i in range(len(data)):
if data.loc[i]['Region'] == 'Delhi':
delhi_list.append(data.loc[i]['Num'])
delhi_list = []
for i in range(len(data)):
if data.loc[i]['Region'] == 'Delhi':
delhi_list.append(data.loc[i]['Num'])
I am getting the results, but I want to achieve the data by the use of dictionary in python. Can I use it?
IIUC, you can use groupby, apply the list aggregation then use to_dict:
data.groupby('Region')['Num'].apply(list).to_dict()
[out]
{'Bharat': [95604535786],
'Delhi': [89824304549, 85879761557],
'Hawk Eye': [87888888889],
'Indore': [79744001550],
'Noida': [91611611478],
'Russia': [80777784005, 70008700104],
'Spain': [87707101233],
'USA': [98271377772]}
I'm trying to merge 2 DataFrames of different sizes, both are indexed by 'Country'. The first dataframe 'GDP_EN' contains every country in the world, and the second dataframe 'ScimEn' contains 15 countries.
When I try to merge these DataFrames,instead of merging the columns based on index countries of ScimEn, I got back 'Country_x' and 'Country_y'. 'Country_x' came from GDP_EN, which are the first 15 countries in alphabetical order. 'Country_y' are the 15 countries from ScimEn. I'm wondering why didn't they merge?
I used:
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
I think both DataFrames are not indexes by Country, but Country is column add parameter on='Country':
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]})
print (GDP_EN)
Country a
0 USA 4
1 France 8
2 Slovakia 6
3 Russia 9
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]})
print (ScimEn)
Country b
0 France 80
1 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
Country_x a Country_y b
0 USA 4 France 80
1 France 8 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,on='Country',how='right')
print (DF)
Country a b
0 France 8 80
1 Slovakia 6 70
If Country are indexes it works perfectly:
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]}).set_index('Country')
print (GDP_EN)
a
Country
USA 4
France 8
Slovakia 6
Russia 9
print (GDP_EN.index)
Index(['USA', 'France', 'Slovakia', 'Russia'], dtype='object', name='Country')
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]}).set_index('Country')
print (ScimEn)
b
Country
France 80
Slovakia 70
print (ScimEn.index)
Index(['France', 'Slovakia'], dtype='object', name='Country')
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
a b
Country
France 8 80
Slovakia 6 70