Pandas dataframe Split One column data into 2 using some condition - python

I have one dataframe which is below-
0
____________________________________
0 Country| India
60 Delhi
62 Mumbai
68 Chennai
75 Country| Italy
78 Rome
80 Venice
85 Milan
88 Country| Australia
100 Sydney
103 Melbourne
107 Perth
I want to Split the data in 2 columns so that in one column there will be country and on other there will be city. I have no idea where to start with. I want like below-
0 1
____________________________________
0 Country| India Delhi
1 Country| India Mumbai
2 Country| India Chennai
3 Country| Italy Rome
4 Country| Italy Venice
5 Country| Italy Milan
6 Country| Australia Sydney
7 Country| Australia Melbourne
8 Country| Australia Perth
Any Idea how to do this?

Look for rows where | is present and pull into another column, and fill down on the newly created column :
(
df.rename(columns={"0": "city"})
# this looks for rows that contain '|' and puts them into a
# new column called Country. rows that do not match will be
# null in the new column.
.assign(Country=lambda x: x.loc[x.city.str.contains("\|"), "city"])
# fill down on the Country column, this also has the benefit
# of linking the Country with the City,
.ffill()
# here we get rid of duplicate Country entries in city and Country
# this ensures that only Country entries are in the Country column
# and cities are in the City column
.query("city != Country")
# here we reverse the column positions to match your expected output
.iloc[:, ::-1]
)
Country city
60 Country| India Delhi
62 Country| India Mumbai
68 Country| India Chennai
78 Country| Italy Rome
80 Country| Italy Venice
85 Country| Italy Milan
100 Country| Australia Sydney
103 Country| Australia Melbourne
107 Country| Australia Perth

Use DataFrame.insert with Series.where and Series.str.startswith for replace not matched values to missing values with ffill for forward filling missing values and then remove rows with same values in both by Series.ne for not equal in boolean indexing:
df.insert(0, 'country', df[0].where(df[0].str.startswith('Country')).ffill())
df = df[df['country'].ne(df[0])].reset_index(drop=True).rename(columns={0:'city'})
print (df)
country city
0 Country|India Delhi
1 Country|India Mumbai
2 Country|India Chennai
3 Country|Italy Rome
4 Country|Italy Venice
5 Country|Italy Milan
6 Country|Australia Sydney
7 Country|Australia Melbourne
8 Country|Australia Perth

Related

Subtract value of column based on another column

I have a big dataframe (the following is an example)
country
value
portugal
86
germany
20
belgium
21
Uk
81
portugal
77
UK
87
I want to subtract values by 60 whenever the country is portugal or UK, the dataframe should look like (Python)
country
value
portugal
26
germany
20
belgium
21
Uk
21
portugal
17
UK
27
IUUC, use isin on the lowercase country string to check if the values is in a reference list, then slice the dataframe with loc for in place modification:
df.loc[df['country'].str.lower().isin(['portugal', 'uk']), 'value'] -= 60
output:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27
Use numpy.where:
In [1621]: import numpy as np
In [1622]: df['value'] = np.where(df['country'].str.lower().isin(['portugal', 'uk']), df['value'] - 60, df['value'])
In [1623]: df
Out[1623]:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27

pandas update specific rows in specific columns in one dataframe based on another dataframe

I have two dataframes, Big and Small, and I want to update Big based on the data in Small, only in specific columns.
this is Big:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating 212
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain Madrid paella 743
and this is small:
>>>ID name country city hobby age
0 12 Melinda Peru Lima eating 24
4 44 Gil Spain Barcelona friends 21
I would like to update the rows in Big based on info from Small, on the ID number. I would also like to change only specific columns, the age and the city, and not the name /country/city....
so the result table should look like this:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating *24*
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain *Barcelona* paella *21*
I know to us eupdate but in this case I don't want to change all the the columns in each row, but only specific ones. Is there way to do that?
Use DataFrame.update by ID converted to index and selecting columns for processing - here only age and city:
df11 = df1.set_index('ID')
df22 = df2.set_index('ID')[['age','city']]
df11.update(df22)
df = df11.reset_index()
print (df)
ID name country city hobby age
0 12 Meli Peru Lima eating 24.0
1 15 Saya USA new-york drinking 34.0
2 34 Aitel Jordan Amman riding 51.0
3 23 Tanya Russia Moscow sports 75.0
4 44 Gil Spain Barcelona paella 21.0

Compare two data frames (Source Vs Target) and leave empty row if records not found in Target table (having same index number as source)

Want to compare data present in dfs "source" with 'Index' number
against dfs "Target" and if the searched index is not found in target dfs..blank row has to be printed in target table with same index key as given in source. Is any other way to achieve without loop because I need to compare dataset of 500,000 records.
Below is the source and target and expected data frames. Source data has record for index number = 3, where as target didn't have record with index number = 3.
I wanted to print blank row with same index number as source.
Source:
Index Employee ID Employee Name Age City Country
1 5678 John 30 New york USA
2 5679 Sam 35 New york USA
3 5680 Johy 25 New york USA
4 5681 Rose 70 New york USA
5 5682 Tom 28 New york USA
6 5683 Nick 49 New york USA
7 5684 Ricky 20 Syney Australia
Target:
Index Employee ID Employee Name Age City Country
1 5678 John 30 New york USA
2 5679 Sam 35 New york USA
4 5681 Rose 70 New york USA
5 5682 Tom 28 New york USA
6 5683 Nick 49 New york USA
7 5684 Ricky 20 Syney Australia
Expected:
Index Employee ID Employee Name Age City Country
1 5678 John 30 New york USA
2 5679 Sam 35 New york USA
3
4 5681 Rose 70 New york USA
5 5682 Tom 28 New york USA
6 5683 Nick 49 New york USA
7 5684 Ricky 20 Syney Australia
Please suggest if there is any way to do it without looping as I need to compare dataset of 500,000 records.
You can reindex and fillna() with '' blank space:
Target.reindex(Source.index).fillna('')
Or:
Target.reindex(Source.index,fill_value='')
If Index is a column and not actually an index, set it as index:
Source=Source.set_index('Index')
Target=Target.set_index('Index')
Not the best way, I prefer #anky_91's way:
>>> df = pd.concat([source, target]).drop_duplicates(keep='first')
>>> df.loc[~df['Index'].isin(source['Index']) | ~df['Index'].isin(target['Index']), df.columns.drop('Index')] = ''
>>> df
Index Employee ID Employee Name Age City Country
0 1 5678 John 30 New york USA
1 2 5679 Sam 35 New york USA None
2 3
3 4 5681 Rose 70 New york USA
4 5 5682 Tom 28 New york USA None
5 6 5683 Nick 49 New york USA
6 7 5684 Ricky 20 Syney Australia
>>>

How to filter a transposed pandas dataframe?

Say I have a transposed df like so
id 0 1 2 3
0 1361 Spain Russia South Africa China
1 1741 Portugal Cuba UK Ukraine
2 1783 Germany USA France Egypt
3 1353 Brazil Russia Japan Kenya
4 1458 India Romania Holland Nigeria
How could I get all rows where there is 'er' so it'll return me this
id 0 1 2 3
2 1783 Germany USA France Egypt
4 1458 India Romania Holland Nigeria
because 'er' is contained in Germany and Nigeria.
Thanks!
Using contains
df[df.apply(lambda x :x.str.contains(pat='er')).any(1)]
Out[96]:
id 0 1 2 3
2 1783 Germany USA France Egypt None
4 1458 India Romania Holland Nigeria None
Use apply + str.contains across rows:
df = df[df.apply(lambda x: x.str.contains('er').any(), axis=1)]
print(df)
id 0 1 2 3
2 1783 Germany USA France Egypt
4 1458 India Romania Holland Nigeria

python pandas groupby sort rank/top n

I have a dataframe that is grouped by state and aggregated to total revenue where sector and name are ignored. I would now like to break the underlying dataset out to show state, sector, name and the top 2 by revenue in a certain order(i have a created an index from a previous dataframe that lists states in a certain order). Using the below example, I would like to use my sorted index (Kentucky, California, New York) that lists only the top two results per state (in previously stated order by Revenue):
Dataset:
State Sector Name Revenue
California 1 Tom 10
California 2 Harry 20
California 3 Roger 30
California 2 Jim 40
Kentucky 2 Bob 15
Kentucky 1 Roger 25
Kentucky 3 Jill 45
New York 1 Sally 50
New York 3 Harry 15
End Goal Dataframe:
State Sector Name Revenue
Kentucky 3 Jill 45
Kentucky 1 Roger 25
California 2 Jim 40
California 3 Roger 30
New York 1 Sally 50
New York 3 Harry 15
You could use a groupby in conjunction with apply:
df.groupby('State').apply(lambda grp: grp.nlargest(2, 'Revenue'))
Output:
Sector Name Revenue
State State
California California 2 Jim 40
California 3 Roger 30
Kentucky Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York New York 1 Sally 50
New York 3 Harry 15
Then you can drop the first level of the MultiIndex to get the result you're after:
df.index = df.index.droplevel()
Output:
Sector Name Revenue
State
California 2 Jim 40
California 3 Roger 30
Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York 1 Sally 50
New York 3 Harry 15
You can sort_values then using groupby + head
df.sort_values('Revenue',ascending=False).groupby('State').head(2)
Out[208]:
State Sector Name Revenue
7 NewYork 1 Sally 50
6 Kentucky 3 Jill 45
3 California 2 Jim 40
2 California 3 Roger 30
5 Kentucky 1 Roger 25
8 NewYork 3 Harry 15

Categories

Resources