Repeating rows in a pandas dataframe

Repeating rows in a pandas dataframe - python

I have a dataframe that I would like to 'double' (or triple, or....). I am not trying to concatenate a dataframe with itself, i.e. have one full copy of the df stacked on top of another full copy of the df.
Starting with this:
import pandas as pd
from io import StringIO
from IPython.display import display
A_csv = """country
Afghanistan
Brazil
China"""
with StringIO(A_csv) as fp:
A = pd.read_csv(fp)
display(A)
result
country
0 Afghanistan
1 Brazil
2 China
I want to get something like this; the index and indentation aren't so important.
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China

Use np.repeat:
df = pd.DataFrame(A.values.repeat(2), columns=A.columns)
df
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China
For N-D dataframes, the solution should be extended using an axis parameter in repeat:
df = pd.DataFrame(A.values.repeat(2, axis=0), columns=A.columns)

You can use np.repeat
pd.DataFrame(np.repeat(df['country'], 2)).reset_index(drop = True)
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China

By using pd.concat
pd.concat([df]*2,axis=0).sort_index().reset_index(drop=True)
Out[56]:
country
0 Afghanistan
1 Afghanistan
2 Brazil
3 Brazil
4 China
5 China

Related

How to keep the values with most frequent prefix in a groupby pandas dataframe?

Let's say I have this dataframe :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 Spain m2_location
4 USA m1_name
5 USA m2_name
6 USA m3_size
7 USA m3_location
I want to group on the "Country" columns and to keep the records with the most frequent records in the groupby object.
The expected result would be :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
I already tried extracting the prefix, then getting the mode of the prefix on the dataframe and merging rows with this mode, but I feel that a more direct and more efficient solution exists.
Here is the working sample code below for reproducible results :
df = pd.DataFrame({
"Country": ["Spain","Spain","Spain","Spain","USA","USA","USA","USA"],
"City": ["m1_name","m1_location","m1_size","m2_location","m1_name","m2_name","m3_size","m3_location"]
})
df['prefix'] = df['City'].str[1]
modes = df.groupby('Country')['prefix'].agg(pd.Series.mode).rename("modes")
df = df.merge(modes, how="right", left_on=['Country','prefix'], right_on=['Country',"modes"])
df = df.drop(['modes','prefix'], axis = 1)
print(df)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location

You can try groupby and apply to filter group rows
out = (df.assign(prefix=df['City'].str.split('_').str[0])
.groupby('Country')
.apply(lambda g: g[g['prefix'].isin(g['prefix'].mode())])
.reset_index(drop=True)
.drop('prefix',axis=1))
print(out)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location

Use:
In [575]: df['Prefix_count'] = df.groupby(['Country', df.City.str.split('_').str[0]])['City'].transform('size')
In [589]: idx = df.groupby('Country')['Prefix_count'].transform(max) == df['Prefix_count']
In [593]: df[idx].drop('Prefix_count', 1)
Out[593]:
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location

An interesting fact about the solutions proposed below is that Mayank's one is way faster. I ran it on 1000 rows on my data and got :
Mayank's solution : 0.020 seconds
Ynjxsjmh's solution : 0.402 seconds
My (OP) solution : 0.122 seconds

Problem with New Column in Pandas Dataframe

I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.

Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)

Weird behaviour with pandas cut, groupby and multiindex in Python

I have a dataframe like this one,
Continent % Renewable
Country
China Asia 2
United States North America 1
Japan Asia 1
United Kingdom Europe 1
Russian Federation Europe 2
Canada North America 5
Germany Europe 2
India Asia 1
France Europe 2
South Korea Asia 1
Italy Europe 3
Spain Europe 3
Iran Asia 1
Australia Australia 1
Brazil South America 5
where the % Renewableis a column created using the cut function,
Top15['% Renewable'] = pd.cut(Top15['% Renewable'], 5, labels=range(1,6))
when I group by Continentand % Renewable to count the number of countries in each subset I do,
count_groups = Top15.groupby(['Continent', '% Renewable']).size()
which is,
Continent % Renewable
Asia 1 4
2 1
Australia 1 1
Europe 1 1
2 3
3 2
North America 1 1
5 1
South America 5 1
The weird thing is the indexing now, if I index for a value that the category value is > 0 this gives me the value,
count_groups.loc['Asia', 1]
>> 4
if not,
count_groups.loc['Asia', 3]
>> IndexingError: Too many indexers
shouldn't it give me a 0 as there are no entries in that category? I would assume so as that dataframe was created using the groupby.
If not, can anyone suggest a procedure so I can preserve the 0 nr of countries for a category of % Renewable?

You have a Series with MultiIndex. Normally, we use tuples for indexing with MultiIndexes but pandas can be flexible about that.
In my opinion, count_groups.loc[('Asia', 3)] should raise a KeyError since this pair does not appear in the index but that's for pandas developers to decide I guess.
To return a default value from a Series, we can use get like we do in dictionaries:
count_groups.get(('Asia', 3), 0)
This will return 0 if the key does not exist.

Why does using df.drop to delete a column drop the whole dataframe?

Following this post I tried this to delete two columns from a dataframe:
import pandas as pd
from io import StringIO
A_csv = """cases,population,country,year,type,count
745,19987071,Afghanistan,1999,population,19987071
2666,20595360,Afghanistan,2000,population,20595360
37737,172006362,Brazil,1999,population,172006362
80488,174504898,Brazil,2000,population,174504898
212258,1272915272,China,1999,population,1272915272
213766,1280428583,China,2000,population,1280428583"""
with StringIO(A_csv) as fp:
A = pd.read_csv(fp)
print(A)
print()
dropcols = ["type", "count"]
A = A.drop(dropcols, axis = 1, inplace = True)
print(A)
result
cases population country year type count
0 745 19987071 Afghanistan 1999 population 19987071
1 2666 20595360 Afghanistan 2000 population 20595360
2 37737 172006362 Brazil 1999 population 172006362
3 80488 174504898 Brazil 2000 population 174504898
4 212258 1272915272 China 1999 population 1272915272
5 213766 1280428583 China 2000 population 1280428583
None
Is there something obvious that is escaping me?

These solutions were mentioned in the comments. I'm just fleshing them out in this post.
When using drop, be wary of the two options you have.
One of them is to drop inplace. When this is done, the dataframe is operated upon and changes are made to the original. This means that this is sufficient.
A.drop(dropcols, axis=1, inplace=1)
A
cases population country year
0 745 19987071 Afghanistan 1999
1 2666 20595360 Afghanistan 2000
2 37737 172006362 Brazil 1999
3 80488 174504898 Brazil 2000
4 212258 1272915272 China 1999
5 213766 1280428583 China 2000
As the df.drop documentation specifies:
inplace : bool, default False
If True, do operation inplace and return None.
Note that when drop is called inplace, it returns None (that is the default value of any function that does not return a value), and A will have already been updated.
The other option is to drop, but return a copy. This means that the original is not modified. So, you can now do:
B = A.drop(dropcols, axis=1)
B
cases population country year
0 745 19987071 Afghanistan 1999
1 2666 20595360 Afghanistan 2000
2 37737 172006362 Brazil 1999
3 80488 174504898 Brazil 2000
4 212258 1272915272 China 1999
5 213766 1280428583 China 2000
A
cases population country year type count
0 745 19987071 Afghanistan 1999 population 19987071
1 2666 20595360 Afghanistan 2000 population 20595360
2 37737 172006362 Brazil 1999 population 172006362
3 80488 174504898 Brazil 2000 population 174504898
4 212258 1272915272 China 1999 population 1272915272
5 213766 1280428583 China 2000 population 1280428583
Where B and A exist separately.
Note that you are not saving any memory working with inplace - both methods create a copy. However, in the former case, a copy is made behind the scene and the changes are added back into the original object.

how to convert a text file (with unneeded double quotes) into pandas DataFrame?

I need to import web-based data (as posted below) into Python. I used urllib2.urlopen (data available here). However, the data was imported as string lines. How can I convert them into a pandas DataFrame while stripping away the double-quotes "? Thank you for your help.
"country","country isocode","year","POP","XRAT","tcgdp","cc","cg"
"Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896"
"Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332"
"India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773"
"Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415"
"Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494"
"South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933"
"United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789"
"Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"

You can do:
>>> import pandas as pd
>>> df=pd.read_csv('https://raw.githubusercontent.com/QuantEcon/QuantEcon.py/master/data/test_pwt.csv')
>>> df
country country isocode year POP XRAT \
0 Argentina ARG 2000 37335.653 0.999500
1 Australia AUS 2000 19053.186 1.724830
2 India IND 2000 1006300.297 44.941600
3 Israel ISR 2000 6114.570 4.077330
4 Malawi MWI 2000 11801.505 59.543808
5 South Africa ZAF 2000 45064.098 6.939830
6 United States USA 2000 282171.957 1.000000
7 Uruguay URY 2000 3219.793 12.099592
tcgdp cc cg
0 295072.218690 75.716805 5.578804
1 541804.652100 67.759026 6.720098
2 1728144.374800 64.575551 14.072206
3 129253.894230 64.436451 10.266688
4 5026.221784 74.707624 11.658954
5 227242.369490 72.718710 5.726546
6 9898700.000000 72.347054 6.032454
7 25255.961693 78.978740 5.108068

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Repeating rows in a pandas dataframe - python

Use np.repeat: df = pd.DataFrame(A.values.repeat(2), columns=A.columns) df country 0 Afghanistan 1 Afghanistan 2 Brazil 3 Brazil 4 China 5 China For N-D dataframes, the solution should be extended using an axis parameter in repeat: df = pd.DataFrame(A.values.repeat(2, axis=0), columns=A.columns)

You can use np.repeat pd.DataFrame(np.repeat(df['country'], 2)).reset_index(drop = True) country 0 Afghanistan 1 Afghanistan 2 Brazil 3 Brazil 4 China 5 China

By using pd.concat pd.concat([df]*2,axis=0).sort_index().reset_index(drop=True) Out[56]: country 0 Afghanistan 1 Afghanistan 2 Brazil 3 Brazil 4 China 5 China

Related

How to keep the values with most frequent prefix in a groupby pandas dataframe?

Problem with New Column in Pandas Dataframe

Weird behaviour with pandas cut, groupby and multiindex in Python

Why does using df.drop to delete a column drop the whole dataframe?

how to convert a text file (with unneeded double quotes) into pandas DataFrame?

Categories

Resources