Delete value Pandas Column - python

I started to learn about pandas and try to analyze a data
So in my data there is a column country which contain a few country,I only want to take the first value and change it to a new column.
An example First index have Colombia,Mexico,United Stated and I only wanna to take the first one Colombia [0] and delete the other contry[1:x],is this possible?
I try a few like loc,iloc or drop() but I hit a dead end so I asked in here

You can use Series.str.split:
df['country'] = df['country'].str.split(',').str[0]
Consider below df for example:
In [1520]: df = pd.DataFrame({'country':['Colombia, Mexico, US', 'Croatia, Slovenia, Serbia', 'Denmark', 'Denmark, Brazil']})
In [1521]: df
Out[1521]:
country
0 Colombia, Mexico, US
1 Croatia, Slovenia, Serbia
2 Denmark
3 Denmark, Brazil
In [1523]: df['country'] = df['country'].str.split(',').str[0]
In [1524]: df
Out[1524]:
country
0 Colombia
1 Croatia
2 Denmark
3 Denmark

Use .str.split():
df['country'] = df['country'].str.split(',',expand=True)[0]

Related

Sorting Data in Pandas

I have the dataset shown below. I am trying to sort it so that the columns are in this order: Week End, Australia, Germany, France, etc...
I have tried using loc and assigning each of the data sets as variables but when I create a new DataFrame it causes an error. Any help would be appreciated.
This is the data before any changes:
Region
Week End
Value
Australia
2014-01-11
1.480510
Germany
2014-01-11
1.481258
France
2014-01-11
0.986507
United Kingdom
2014-01-11
1.973014
Italy
2014-01-11
0.740629
This is my desired output:
Week End
Australia
Germany
France
United Kingdom
Italy
2014-01-11
1.480510
1.481258
0.986507
1.973014
0.740629
What I've tried:
cols = (['Region','Week End','Value'])
df = GS.loc[GS['Brand'].isin(rows)]
df = df[cols]
AUS = df.loc[df['Region'] == 'Australia']
JPN = df.loc[df['Region'] == 'Japan']
US = df.loc[df['Region'] == 'United States of America']
I think that you could actually just do:
df.pivot(index="Week End", columns="Region", values="Value")
User 965311532's answer is much more concise, but an alternative approach using dictionaries would be:
new_df = {'Week End': df['Week End'][0]}
new_df.update({region: value for region, value in zip(df['Region'], df['Value'])})
new_df = pd.DataFrame(new_df, index = [0])
As user 965311532 pointed out, the above code will not work if there are more dates. In this case, we could use pandas groupby:
dates = []
for date, group in df.groupby('Week End'):
date_df = {'Week End': date}
date_df.update({region: value for region, value in zip(df['Region'], df['Value'])})
date_df = pd.DataFrame(date_df, index = [0])
dates.append(date_df)
new_df = pd.concat(dates)

How to iterate over pandas dataframe rows, find string and separate into colums?

So here is my issue, I have a dataframe df with a column "Info" like this:
0 US[edit]
1 Boston(B1)
2 Washington(W1)
3 Chicago(C1)
4 UK[edit]
5 London(L2)
6 Manchester(L2)
I would like to put all the strings containing "[ed]" into a separate column df['state'], the remaining strings should be put into another column df['city']. I wanna do some clean up too and remove things in [] and (). This is what I tried:
for ind, row in df.iterrows():
if df['Info'].str.contains('[ed', regex=False):
df['state']=df['info'].str.split('\[|\(').str[0]
else:
df['city']=df['info'].str.split('\[|\(').str[0]
At the end I would like to have something like this
US Boston
US Washington
US Chicago
UK London
UK Manchester
When I try this I always get "The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()"
Any help? Thanks!!
Use Series.where with forward filling missing values for state column, for city assign Series s and then
filter by boolean indexing with inverted mask by ~:
m = df['Info'].str.contains('[ed', regex=False)
s = df['Info'].str.split('\[|\(').str[0]
df['state'] = s.where(m).ffill()
df['city'] = s
df = df[~m]
print (df)
Info state city
1 Boston(B1) US Boston
2 Washington(W1) US Washington
3 Chicago(C1) US Chicago
5 London(L2) UK London
6 Manchester(L2) UK Manchester
If you want you can also remove original column by adding DataFrame.pop:
m = df['Info'].str.contains('[ed', regex=False)
s = df.pop('Info').str.split('\[|\(').str[0]
df['state'] = s.where(m).ffill()
df['city'] = s
df = df[~m]
print (df)
state city
1 US Boston
2 US Washington
3 US Chicago
5 UK London
6 UK Manchester
I would do:
s = df.Info.str.extract('([\w\s]+)(\[edit\])?')
df['city'] = s[0]
df['state'] = s[0].mask(s[1].isna()).ffill()
df = df[s[1].isna()]
Output:
Info city state
1 1 Boston(B1) Boston US
2 2 Washington(W1) Washington US
3 3 Chicago(C1) Chicago US
5 5 London(L2) London UK
6 6 Manchester(L2) Manchester UK

Problem with New Column in Pandas Dataframe

I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.
Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)

Create a new dataframe column by comparing two other columns in different dataframes [duplicate]

This question already has answers here:
Mapping columns from one dataframe to another to create a new column [duplicate]
(2 answers)
Closed 4 years ago.
I have a DataFrame which contains Alpha 2 country codes (UK, ES, SL etc) and I need these to be the country names. I created a second data frame that has all the Alpha 2 country codes in one column and the corresponding names in another.
I'm trying to compare these two columns then using the index to create the new column. However I am struggling to do this without using a loop. I feel like there is a more efficient way to do this without looping?
I have tried using a for loop, iterating over:
cube_data = pd.DataFrame({'Country Code':['UK','ES','SL']})
alpha2 = pd.DataFrame({'Code':['ES','GH','UK','SL'],
'Name':['Spain','Ghana','United Kingdom','Sierra Leone']})
cube_data
Country Code
0 UK
1 ES
2 SL
alpha2
Code Name
0 ES Spain
1 GH Ghana
2 UK United Kingdom
3 SL Sierra Leone
I have used a for loop to iterate through the columns and when the code from cube_data is found in alpha2['Code'] the index is used to create a new series which has alpha['Name'] at the correct position corresponding to the cube_data.
end result is:
cube_data
Country Code Name
0 UK United Kingdom
1 ES Spain
2 SL Sierra Leone
Surely there is a better way to do this without looping? I have had a look at series.isin() and series.map() but these do not seem to provide the result I need.
Can this be done without a loop?
You can use pandas merge:
df = alpha2.merge(cube_data, left_on='Code', right_on='Country Code', how='inner').drop('Code', axis=1)
merge works like an SQL join: here we merge alpha2 with cube_data. We use the columns 'Code' from alpha2 and 'Country Code' from cube_data to merge the two datframes together and use an 'inner' join logic meaning that only values present in both dataframes will be kept. Finally we drop the column 'Code' from alpha2 which contains the same values as the column 'Country Code'
Use map after converting alpha2 to a mappable object.
First we make our map:
>> country_map = alpha2.set_index('Code')['Name'].to_dict()
>> # country_map = dict(alpha2[['Code', 'Name']].values)
>> # country_map = alpha2.set_index('Code')['Name']
>> print(country_map)
{'ES': 'Spain', 'UK': 'United Kingdom', 'GH': 'Ghana', 'SL': 'Sierra Leone'}
Then we map it on the Country Code column:
>> cube_data['Country'] = cube_data['Country Code'].map(country_map)
>> print(cube_data)
Country Code Country
0 UK United Kingdom
1 ES Spain
2 SL Sierra Leone
Have you looked into the pycountry module?
I've changed your 'UK' alpha_2 to 'GB'.
import pandas as pd
import pycountry
cube_data = pd.DataFrame({'Country Code':['GB','ES','SL']})
for alpha2_code in cube_data['Country Code']:
c = pycountry.countries.get(alpha_2=alpha2_code)
print(c.name)
output:
United Kingdom
Spain
Sierra Leone
Using a lambda to create new column
df = cube_data
df['Name'] = df['Country Code'].apply(lambda x: pycountry.countries.get(alpha_2=x).name)
print(df)
output:
Country Code name
0 GB United Kingdom
1 ES Spain
2 SL Sierra Leone

How do I use a mapping variable to re-index a dataframe?

I have the following data frame:
population GDP
country
United Kingdom 4.5m 10m
Spain 3m 8m
France 2m 6m
I also have the following information in a 2 column dataframe(happy for this to be made into another datastruct if that will be more beneficial as the plan is that it will be sorted in a VARS file.
county code
Spain es
France fr
United Kingdom uk
The 'mapping' datastruct will be sorted in a random order as countries will be added/removed at random times.
What is the best way to re-index the data frame to its country code from its country name?
Is there a smart solution that would also work on other columns so for example if a data frame was indexed on date but one column was df['county'] then you could change df['country'] to its country code? Finally is there a third option that would add an additional column that was either country/code which selected the right code based on a country name in another column?
I think you can use Series.map, but it works only with Series, so need Index.to_series. Last rename_axis (new in pandas 0.18.0):
df1.index = df1.index.to_series().map(df2.set_index('county').code)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
It is same as mapping by dict:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.to_series().map(d)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
EDIT:
Another solution with Index.map, so to_series is omitted:
d = df2.set_index('county').code.to_dict()
print (d)
{'France': 'fr', 'Spain': 'es', 'United Kingdom': 'uk'}
df1.index = df1.index.map(d.get)
df1 = df1.rename_axis('county')
#pandas bellow 0.18.0
#df1.index.name = 'county'
print (df1)
population GDP
county
uk 4.5m 10m
es 3m 8m
fr 2m 6m
Here are some brief ways to approach your 3 questions. More details below:
1) How to change index based on mapping in separate df
Use df_with_mapping.todict("split") to create a dictionary, then use a list comprehension to change it into {"old1":"new1",...,"oldn":"newn"} form then use df.index = df.base_column.map(dictionary) to get the changed index.
2) How to change index if the new column is in the same df:
df.index = df["column_you_want"]
3) Creating a new column by mapping on a old column:
df["new_column"] = df["old_column"].map({"old1":"new1",...,"oldn":"newn"})
1) Mapping for the current index exists in separate dataframe but you don't have the mapped column in the dataframe yet
This is essentially the same as question 2 with the additional step of creating a dictionary for the mapping you want.
#creating the mapping dictionary in the form of current index : future index
df2 = pd.DataFrame([["es"],["fr"]],index = ["spain","france"])
interm_dict = df2.to_dict("split") #Creates a dictionary split into column labels, data labels and data
mapping_dict = {country:data[0] for country,data in zip(interm_dict["index"],interm_dict['data'])}
#We only want the first column of the data and the index so we need to make a new dict with a list comprehension and zip
df["country"] = df.index #Create a new column if u want to save the index
df.index = pd.Series(df.index).map(mapping_dict) #change the index
df.index.name = "" #Blanks out index name
df = df.drop("county code",1) #Drops the county code column to avoid duplicate columns
Before:
county code language
spain es spanish
france fr french
After:
language country
es spanish spain
fr french france
2) Changing the current index to one of the columns already in the dataframe
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "french"])
df["country"] = df.index #if you want to save the original index
df.index = df["county code"] #The only step you actually need
df.index.name = "" #if you want a blank index name
df = df.drop("county code",1) #if you dont want the duplicate column
Before:
county code language
spain es spanish
french fr french
After:
language country
es spanish spain
fr french french
3) Creating an additional column based on another column
This is again essentially the same as step 2 except we create an additional column instead of assigning .index to the created series.
df = pd.DataFrame([["es","spanish"],["fr","french"]], columns = ["county code","language"], index = ["spain", "france"])
df["city"] = df["county code"].map({"es":"barcelona","fr":"paris"})
Before:
county code language
spain es spanish
france fr french
After:
county code language city
spain es spanish barcelona
france fr french paris

Categories

Resources