Converting a dataframe - python

Pollutant
Delhi
London
Paris
PM25
12
36
43
Ozone
120
34
42
NO2
192
35
12
I'm trying to convert a pandas DataFrame like the one above to the one below. Any help would be greatly appreciated! Thank you in advance.
Pollutant
Level
City
PM25
12
Delhi
Ozone
120
Delhi
NO2
192
Delhi
PM25
36
London
Ozone
34
London
NO2
35
London
PM25
43
Paris
Ozone
42
Paris
NO2
12
Paris

You can use pandas.melt() and utilise the different parameters correctly which is exactly what you need:
>>> df.melt(id_vars=['Pollutant'],value_name='Level',var_name='City')
Pollutant City Level
0 PM25 Delhi 12
1 Ozone Delhi 120
2 NO2 Delhi 192
3 PM25 London 36
4 Ozone London 34
5 NO2 London 35
6 PM25 Paris 43
7 Ozone Paris 42
8 NO2 Paris 12
value_name will give a name to the column where your values will be.
Same logic applies to var_name.

Related

Subtract value of column based on another column

I have a big dataframe (the following is an example)
country
value
portugal
86
germany
20
belgium
21
Uk
81
portugal
77
UK
87
I want to subtract values by 60 whenever the country is portugal or UK, the dataframe should look like (Python)
country
value
portugal
26
germany
20
belgium
21
Uk
21
portugal
17
UK
27
IUUC, use isin on the lowercase country string to check if the values is in a reference list, then slice the dataframe with loc for in place modification:
df.loc[df['country'].str.lower().isin(['portugal', 'uk']), 'value'] -= 60
output:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27
Use numpy.where:
In [1621]: import numpy as np
In [1622]: df['value'] = np.where(df['country'].str.lower().isin(['portugal', 'uk']), df['value'] - 60, df['value'])
In [1623]: df
Out[1623]:
country value
0 portugal 26
1 germany 20
2 belgium 21
3 Uk 21
4 portugal 17
5 UK 27

Pandas Dataframe replace string value based on condition AND using original value

I have a dataframe that looks like this:
YEAR MONTH DAY_OF_MONTH DAY_OF_WEEK ORIGIN_CITY_NAME ORIGIN_STATE_ABR DEST_CITY_NAME DEST_STATE_ABR DEP_TIME DEP_DELAY_NEW ARR_TIME ARR_DELAY_NEW CANCELLED AIR_TIME
0 2020 1 1 3 Ontario CA San Francisco CA 1851 41 2053 68 0 74
1 2020 1 1 3 Ontario CA San Francisco CA 1146 0 1318 0 0 71
2 2020 1 1 3 Ontario CA San Jose CA 2016 0 2124 0 0 57
3 2020 1 1 3 Ontario CA San Jose CA 1350 10 1505 10 0 63
4 2020 1 1 3 Ontario CA San Jose CA 916 1 1023 0 0 57
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
607341 2020 1 16 4 Portland ME New York NY 554 0 846 65 0 57
607342 2020 1 17 5 Portland ME New York NY 633 33 804 23 0 69
607343 2020 1 18 6 Portland ME New York NY 657 0 810 0 0 55
607344 2020 1 19 7 Portland ME New York NY 705 5 921 39 0 54
607345 2020 1 20 1 Portland ME New York NY 628 0 741 0 0 52
I am trying to modify columns DEP_TIME and ARR_TIME so that they have the format hh:mm. All values should be treated as strings. There are also null values present in some rows that need to be accounted for. Performance is also of consideration (albeit secondary in relation to solving the actual problem) since I need to change about 10M records total.
The challenge in this problem to me is figuring out how to modify these values iteratively based on a condition while also having access to the original value when replacing it. I simply could not find a solution for that specific problem elsewhere. Most problems are using a known constant to replace.
Thanks for your help.

pandas update specific rows in specific columns in one dataframe based on another dataframe

I have two dataframes, Big and Small, and I want to update Big based on the data in Small, only in specific columns.
this is Big:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating 212
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain Madrid paella 743
and this is small:
>>>ID name country city hobby age
0 12 Melinda Peru Lima eating 24
4 44 Gil Spain Barcelona friends 21
I would like to update the rows in Big based on info from Small, on the ID number. I would also like to change only specific columns, the age and the city, and not the name /country/city....
so the result table should look like this:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating *24*
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain *Barcelona* paella *21*
I know to us eupdate but in this case I don't want to change all the the columns in each row, but only specific ones. Is there way to do that?
Use DataFrame.update by ID converted to index and selecting columns for processing - here only age and city:
df11 = df1.set_index('ID')
df22 = df2.set_index('ID')[['age','city']]
df11.update(df22)
df = df11.reset_index()
print (df)
ID name country city hobby age
0 12 Meli Peru Lima eating 24.0
1 15 Saya USA new-york drinking 34.0
2 34 Aitel Jordan Amman riding 51.0
3 23 Tanya Russia Moscow sports 75.0
4 44 Gil Spain Barcelona paella 21.0

How extract data from dataframe and join with an another dataframe

I have two dataframes df and df1. I want to join the both dataframes and get the output in different ways
df
City Date Wind Temperature
London 5/11/2019 14 5
London 6/11/2019 28 6
London 7/11/2019 10 5
Berlin 5/11/2019 23 12
Berlin 6/11/2019 24 12
Berlin 7/11/2019 16 16
Munich 5/11/2019 12 10
Munich 6/11/2019 33 11
Munich 7/11/2019 44 13
Paris 5/11/2019 27 6
Paris 6/11/2019 16 7
Paris 7/11/2019 14 8
Paris 8/11/2019 10 6
df1
ID City Delivery_Date Provider
1456223 London 7/11/2019 Amazon
1456345 London 6/11/2019 Amazon
2345623 Paris 8/11/2019 Walmart
1287456 Paris 7/11/2019 Amazon
4568971 Munich 7/11/2019 Amazon
3456789 Berlin 6/11/2019 Walmart
Output1
ID City Delivery_Date Wind Temperature
1456223 London 7/11/2019 10 5
1456345 London 6/11/2019 28 6
2345623 Paris 8/11/2019 10 6
1287456 Paris 7/11/2019 14 8
4568971 Munich 7/11/2019 44 13
Output 2
Here the weather details of the Item should displayed till its delivery date is met
ID City Delivery_Date Wind Temperature
1456223 London 5/11/2019 14 5
1456223 London 6/11/2019 28 6
1456223 London 7/11/2019 10 5
1287456 Paris 5/11/2019 27 6
1287456 Paris 6/11/2019 16 7
1287456 Paris 7/11/2019 14 8
How can this be done.
considering DF and DF1 as data frames as you explained.
import pandas as pd
output1 = pd.merge(DF1, DF,left_on = ['City','Date'] ,right_on = ['City','Delivery_Date'], how='inner' )
res1 = df1.groupby('City').max() [['Delivery_Date']]
result1 = pd.merge(df,res1, on ='City')
output2 = result1 [result1['Date'] <= result1['Delivery_Date']]
You can use df.merge
import pandas as pd
df.merge(df1[['City','Delivery_Date','ID']],left_on = ['City','Date'] ,right_on = ['City','Delivery_Date'],how='inner')

python pandas groupby sort rank/top n

I have a dataframe that is grouped by state and aggregated to total revenue where sector and name are ignored. I would now like to break the underlying dataset out to show state, sector, name and the top 2 by revenue in a certain order(i have a created an index from a previous dataframe that lists states in a certain order). Using the below example, I would like to use my sorted index (Kentucky, California, New York) that lists only the top two results per state (in previously stated order by Revenue):
Dataset:
State Sector Name Revenue
California 1 Tom 10
California 2 Harry 20
California 3 Roger 30
California 2 Jim 40
Kentucky 2 Bob 15
Kentucky 1 Roger 25
Kentucky 3 Jill 45
New York 1 Sally 50
New York 3 Harry 15
End Goal Dataframe:
State Sector Name Revenue
Kentucky 3 Jill 45
Kentucky 1 Roger 25
California 2 Jim 40
California 3 Roger 30
New York 1 Sally 50
New York 3 Harry 15
You could use a groupby in conjunction with apply:
df.groupby('State').apply(lambda grp: grp.nlargest(2, 'Revenue'))
Output:
Sector Name Revenue
State State
California California 2 Jim 40
California 3 Roger 30
Kentucky Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York New York 1 Sally 50
New York 3 Harry 15
Then you can drop the first level of the MultiIndex to get the result you're after:
df.index = df.index.droplevel()
Output:
Sector Name Revenue
State
California 2 Jim 40
California 3 Roger 30
Kentucky 3 Jill 45
Kentucky 1 Roger 25
New York 1 Sally 50
New York 3 Harry 15
You can sort_values then using groupby + head
df.sort_values('Revenue',ascending=False).groupby('State').head(2)
Out[208]:
State Sector Name Revenue
7 NewYork 1 Sally 50
6 Kentucky 3 Jill 45
3 California 2 Jim 40
2 California 3 Roger 30
5 Kentucky 1 Roger 25
8 NewYork 3 Harry 15

Categories

Resources