Create stacked pandas series from series with list elements - python

I have a pandas series with elements as list:
import pandas as pd
s = pd.Series([ ['United States of America'],['China', 'Hong Kong'], []])
print(s)
0 [United States of America]
1 [China, Hong Kong]
2 []
How to get a series like the following:
0 United States of America
1 China
1 Hong Kong
I am not sure about what happens to 2.

The following options all return Series. Create a new frame and listify.
pd.DataFrame(s.tolist()).stack()
0 0 United States of America
1 0 China
1 Hong Kong
dtype: object
To reset the index, use
pd.DataFrame(s.tolist()).stack().reset_index(drop=True)
0 United States of America
1 China
2 Hong Kong
dtype: object
To convert to DataFrame, call to_frame()
pd.DataFrame(s.tolist()).stack().reset_index(drop=True).to_frame('countries')
countries
0 United States of America
1 China
2 Hong Kong
If you're trying to code golf, use
sum(s, [])
# ['United States of America', 'China', 'Hong Kong']
pd.Series(sum(s, []))
0 United States of America
1 China
2 Hong Kong
dtype: object
Or even,
pd.Series(np.sum(s))
0 United States of America
1 China
2 Hong Kong
dtype: object
However, like most other operations involving sums of lists operations, this is bad in terms of performance (list concatenation operations are inefficient).
Faster operations are possible using chaining with itertools.chain:
from itertools import chain
pd.Series(list(chain.from_iterable(s)))
0 United States of America
1 China
2 Hong Kong
dtype: object
pd.DataFrame(list(chain.from_iterable(s)), columns=['countries'])
countries
0 United States of America
1 China
2 Hong Kong

Or use:
df = pd.DataFrame(s.tolist())
print(df[0].fillna(df[1].dropna().item()))
Output:
0 United States of America
1 China
2 Hong Kong
Name: 0, dtype: object

Assuming that is list
pd.Series(s.sum())
Out[103]:
0 United States of America
1 China
2 Hong Kong
dtype: object

There is a simpler and probably way less computationally expensive to do that through pandas function explode. See at here. In your case, the answer would be:
s.explode()
Simple as it is! In a case with more columns you can specify which one you would like to "explode" by adding the name of it in literals, for example s.explode('country').

Related

How to maintain the same index after sorting a Pandas series?

I have the following Pandas series from the dataframe 'Reducedset':
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
Which gives me:
Country
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
Brazil 2.189794e+12
Italy 2.120175e+12
India 1.769297e+12
Canada 1.660647e+12
Russian Federation 1.565459e+12
Spain 1.418078e+12
Australia 1.164043e+12
South Korea 1.106715e+12
Iran 4.441558e+11
dtype: float64
I want to update the index, so that index of the dataframe Reducedset is in the same order as the series above.
How can I do this?
In other words, when I then look at the entire dataframe, the index order should be the same as in the series above and not like that below:
Reducedset
Rank Documents Citable documents Citations \
Country
China 1 127050 126767 597237
United States 2 96661 94747 792274
Japan 3 30504 30287 223024
United Kingdom 4 20944 20357 206091
Russian Federation 5 18534 18301 34266
Canada 6 17899 17620 215003
Germany 7 17027 16831 140566
India 8 15005 14841 128763
France 9 13153 12973 130632
South Korea 10 11983 11923 114675
Italy 11 10964 10794 111850
Spain 12 9428 9330 123336
Iran 13 8896 8819 57470
Australia 14 8831 8725 90765
Brazil 15 8668 8596 60702
The answer:
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
This first stage finds the mean of columns 10-20 for each row (axis=1) and sorts them in descending order (ascending = False)
Reducedset.reindex(Reducedset.index)
Here, we are resetting the index of the dataframe 'Reducedset' as the index of the amended dataframe above.

string manupilation based on the pattern of two columns, any convenient way?

d = {'country': ['US', 'US', 'United Kingdom', 'United Kingdom'],
'province/state': ['New York', np.nan, 'Gibraltar', np.nan]}
df = pd.DataFrame(data=d)
I guess there are three steps:
Step 1: fill the NA of the province with the related country
df['province/state'].fillna(df['country'], inplace=True]
Step 2: create a new col by concatenating the country and province with '-':
df['new_geo'] = df['country'] + '-' + df['province/state']
Step 3: remove the country if it is repeated:
for example, remove United Kingdom-United Kingdom. Only keep those which are not overlapped, such as United Kingdom-Gibraltar. But I am not sure what regex should be used.
Is there any convenient way to do this?
Try:
df['new_geo'] = np.where(df['province/state'].notna(), df['country'] + '-' + df['province/state'], df['country'])
df['province/state']=df['province/state'].fillna(df['country'])
Outputs:
country province/state new_geo
0 US New York US-New York
1 US US US
2 United Kingdom Gibraltar United Kingdom-Gibraltar
3 United Kingdom United Kingdom United Kingdom
combine strings usings pandas str cat, then fill the empty cells sideways using ffill with axis=1.
res = (df
.assign(new_geo = lambda x: x.country.str.cat(x['province/state'],sep='-'))
.ffill(axis=1)
)
res
country province/state new_geo
0 US New York US-New York
1 US US US
2 United Kingdom Gibraltar United Kingdom-Gibraltar
3 United Kingdom United Kingdom United Kingdom

Pandas - value_counts on multiple values in one cell

I have a dataframe which has a column with multiple values, separated by a comma like this:
Country
Australia, Cuba, Argentina
Australia
United States, Canada, United Kingdom, Argentina
I would like to count each unique value, similar to value_counts, like this:
Australia: 2
Cuba: 1
Argentina: 2
United States: 1
My simplest method is shown below, but I suspect that this can be done more efficiently and neatly.
from collections import Counter
Counter(pd.DataFrame(data['Country'].str.split(',', expand=True)).values.ravel())
Cheers
You can using get_dummies
df.Country.str.get_dummies(sep=', ').sum()
Out[354]:
Argentina 2
Australia 2
Canada 1
Cuba 1
United Kingdom 1
United States 1
dtype: int64
Another option is to split and then use value_counts
pd.Series(df.Country.str.split(', ').sum()).value_counts()
Argentina 2
Australia 2
United Kingdom 1
Canada 1
Cuba 1
United States 1
dtype: int64

how to iterate by loop with values in function using python?

I want to pass values using loop one by one in function using python.Values are stored in dataframe.
def eam(A,B):
y=A +" " +B
return y
Suppose I pass the values of A as country and B as capital .
Dataframe df is
country capital
India New Delhi
Indonesia Jakarta
Islamic Republic of Iran Tehran
Iraq Baghdad
Ireland Dublin
How can I get value using loop
0 India New Delhi
1 Indonesia Jakarta
2 Islamic Republic of Iran Tehran
3 Iraq Baghdad
4 Ireland Dublin
Here you go, just use the following syntax to get a new column in the dataframe. No need to write code to loop over the rows. However, if you must loop, df.iterrows() returns or df.itertuples() provide nice functionality to accomplish similar objectives.
>>> df = pd.read_clipboard(sep='\t')
>>> df.head()
country capital
0 India New Delhi
1 Indonesia Jakarta
2 Islamic Republic of Iran Tehran
3 Iraq Baghdad
4 Ireland Dublin
>>> df.columns
Index(['country', 'capital'], dtype='object')
>>> df['both'] = df['country'] + " " + df['capital']
>>> df.head()
country capital both
0 India New Delhi India New Delhi
1 Indonesia Jakarta Indonesia Jakarta
2 Islamic Republic of Iran Tehran Islamic Republic of Iran Tehran
3 Iraq Baghdad Iraq Baghdad
4 Ireland Dublin Ireland Dublin

Weird behaviour with pandas cut, groupby and multiindex in Python

I have a dataframe like this one,
Continent % Renewable
Country
China Asia 2
United States North America 1
Japan Asia 1
United Kingdom Europe 1
Russian Federation Europe 2
Canada North America 5
Germany Europe 2
India Asia 1
France Europe 2
South Korea Asia 1
Italy Europe 3
Spain Europe 3
Iran Asia 1
Australia Australia 1
Brazil South America 5
where the % Renewableis a column created using the cut function,
Top15['% Renewable'] = pd.cut(Top15['% Renewable'], 5, labels=range(1,6))
when I group by Continentand % Renewable to count the number of countries in each subset I do,
count_groups = Top15.groupby(['Continent', '% Renewable']).size()
which is,
Continent % Renewable
Asia 1 4
2 1
Australia 1 1
Europe 1 1
2 3
3 2
North America 1 1
5 1
South America 5 1
The weird thing is the indexing now, if I index for a value that the category value is > 0 this gives me the value,
count_groups.loc['Asia', 1]
>> 4
if not,
count_groups.loc['Asia', 3]
>> IndexingError: Too many indexers
shouldn't it give me a 0 as there are no entries in that category? I would assume so as that dataframe was created using the groupby.
If not, can anyone suggest a procedure so I can preserve the 0 nr of countries for a category of % Renewable?
You have a Series with MultiIndex. Normally, we use tuples for indexing with MultiIndexes but pandas can be flexible about that.
In my opinion, count_groups.loc[('Asia', 3)] should raise a KeyError since this pair does not appear in the index but that's for pandas developers to decide I guess.
To return a default value from a Series, we can use get like we do in dictionaries:
count_groups.get(('Asia', 3), 0)
This will return 0 if the key does not exist.

Categories

Resources