How extract data from dataframe and join with an another dataframe - python

I have two dataframes df and df1. I want to join the both dataframes and get the output in different ways
df
City Date Wind Temperature
London 5/11/2019 14 5
London 6/11/2019 28 6
London 7/11/2019 10 5
Berlin 5/11/2019 23 12
Berlin 6/11/2019 24 12
Berlin 7/11/2019 16 16
Munich 5/11/2019 12 10
Munich 6/11/2019 33 11
Munich 7/11/2019 44 13
Paris 5/11/2019 27 6
Paris 6/11/2019 16 7
Paris 7/11/2019 14 8
Paris 8/11/2019 10 6
df1
ID City Delivery_Date Provider
1456223 London 7/11/2019 Amazon
1456345 London 6/11/2019 Amazon
2345623 Paris 8/11/2019 Walmart
1287456 Paris 7/11/2019 Amazon
4568971 Munich 7/11/2019 Amazon
3456789 Berlin 6/11/2019 Walmart
Output1
ID City Delivery_Date Wind Temperature
1456223 London 7/11/2019 10 5
1456345 London 6/11/2019 28 6
2345623 Paris 8/11/2019 10 6
1287456 Paris 7/11/2019 14 8
4568971 Munich 7/11/2019 44 13
Output 2
Here the weather details of the Item should displayed till its delivery date is met
ID City Delivery_Date Wind Temperature
1456223 London 5/11/2019 14 5
1456223 London 6/11/2019 28 6
1456223 London 7/11/2019 10 5
1287456 Paris 5/11/2019 27 6
1287456 Paris 6/11/2019 16 7
1287456 Paris 7/11/2019 14 8
How can this be done.

considering DF and DF1 as data frames as you explained.
import pandas as pd
output1 = pd.merge(DF1, DF,left_on = ['City','Date'] ,right_on = ['City','Delivery_Date'], how='inner' )
res1 = df1.groupby('City').max() [['Delivery_Date']]
result1 = pd.merge(df,res1, on ='City')
output2 = result1 [result1['Date'] <= result1['Delivery_Date']]

You can use df.merge
import pandas as pd
df.merge(df1[['City','Delivery_Date','ID']],left_on = ['City','Date'] ,right_on = ['City','Delivery_Date'],how='inner')

Related

How to replace the values of a column to other columns only in NaN values?

How to fill the values of column ["state"] with another column ["country"] only in NaN values?
Like in this Pandas DataFrame:
state country sum
0 NaN China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 NaN India 5
5 NaN Srilanka 6
6 NaN Malaysia 7
7 NaN Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 NaN US 12
12 NaN Canada 13
What code should I do to fill state columns with country column only in NaN values, like this:
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13
I can use this code:
df.loc[df['state'].isnull(), 'state'] = df[df['state'].isnull()]['country'].replace(df['country'])
But in a very large dataset with 300K of rows, it compute for 5-6 minutes and crashed every time. Because it is replacing one value at a time.
Like this
Can anyone help me with efficient code for this?
Please!
Perhaps using fillna without checking for isnull() and replace():
df['state'].fillna(df['country'], inplace=True)
print(df)
Output
state country sum
0 China China 1
1 Assam India 2
2 Odisa India 3
3 Bihar India 4
4 India India 5
5 Srilanka Srilanka 6
6 Malaysia Malaysia 7
7 Bhutan Bhutan 8
8 California US 9
9 Texas US 10
10 Newyork US 11
11 US US 12
12 Canada Canada 13

Converting a dataframe

Pollutant
Delhi
London
Paris
PM25
12
36
43
Ozone
120
34
42
NO2
192
35
12
I'm trying to convert a pandas DataFrame like the one above to the one below. Any help would be greatly appreciated! Thank you in advance.
Pollutant
Level
City
PM25
12
Delhi
Ozone
120
Delhi
NO2
192
Delhi
PM25
36
London
Ozone
34
London
NO2
35
London
PM25
43
Paris
Ozone
42
Paris
NO2
12
Paris
You can use pandas.melt() and utilise the different parameters correctly which is exactly what you need:
>>> df.melt(id_vars=['Pollutant'],value_name='Level',var_name='City')
Pollutant City Level
0 PM25 Delhi 12
1 Ozone Delhi 120
2 NO2 Delhi 192
3 PM25 London 36
4 Ozone London 34
5 NO2 London 35
6 PM25 Paris 43
7 Ozone Paris 42
8 NO2 Paris 12
value_name will give a name to the column where your values will be.
Same logic applies to var_name.

Get column name as new column with the same column value

I have dataframe similar to this one:
name hobby date country 5 10 15 20 ...
Toby Guitar 2020-01-19 Brazil 0.1245 0.2543 0.7763 0.2264
Linda Cooking 2020-03-05 Italy 0.5411 0.2213 Nan 0.3342
Ben Diving 2020-04-02 USA 0.8843 0.2333 0.4486 0.2122
...
I want to the the int colmns, duplicate them, and put the int as the new value of the column, something like this:
name hobby date country 5 5 10 10 15 15 20 20...
Toby Guitar 2020-01-19 Brazil 0.1245 5 0.2543 10 0.7763 15 0.2264 20
Linda Cooking 2020-03-05 Italy 0.5411 5 0.2213 10 Nan 15 0.3342 20
Ben Diving 2020-04-02 USA 0.8843 5 0.2333 10 0.4486 15 0.2122 20
...
I'm not sure how to tackle this and looking for ideas
Here is a solution you can try out,
digits_ = pd.DataFrame(
{col: [int(col)] * len(df) for col in df.columns if col.isdigit()}
)
pd.concat([df, digits_], axis=1)
name hobby date country 5 ... 20 5 10 15 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 0.2264 5 10 15 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 0.3342 5 10 15 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 0.2122 5 10 15 20
I'm not sure if it is the best way to organise data with duplicated column names. I would recommend stacking (melting) it into long format.
df.melt(id_vars=["name", "hobby", "date", "country"])
Result
name hobby date country variable value
0 Toby Guitar 2020-01-19 Brazil 5 0.1245
1 Linda Cooking 2020-03-05 Italy 5 0.5411
2 Ben Diving 2020-04-02 USA 5 0.8843
3 Toby Guitar 2020-01-19 Brazil 10 0.2543
4 Linda Cooking 2020-03-05 Italy 10 0.2213
5 Ben Diving 2020-04-02 USA 10 0.2333
6 Toby Guitar 2020-01-19 Brazil 15 0.7763
7 Linda Cooking 2020-03-05 Italy 15 Nan
8 Ben Diving 2020-04-02 USA 15 0.4486
9 Toby Guitar 2020-01-19 Brazil 20 0.2264
10 Linda Cooking 2020-03-05 Italy 20 0.3342
11 Ben Diving 2020-04-02 USA 20 0.2122
You could use the pandas insert(...) function combined with a for loop
import numpy as np
import pandas as pd
df = pd.DataFrame([['Toby', 'Guitar', '2020-01-19', 'Brazil', 0.1245, 0.2543, 0.7763, 0.2264],
['Linda', 'Cooking', '2020-03-05', 'Italy', 0.5411, 0.2213, np.nan, 0.3342],
['Ben', 'Diving', '2020-04-02', 'USA', 0.8843, 0.2333, 0.4486, 0.2122]],
columns=['name', 'hobby', 'date', 'country', 5, 10, 5, 20])
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)
results:
name hobby date country 5 ... 10 5 5 20 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 10 0.7763 5 0.2264 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 10 NaN 5 0.3342 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 10 0.4486 5 0.2122 20
[3 rows x 12 columns]
I assumed that all your columns are digits from the 5th, but if not you could add in the for loop an if condition to prevent this :
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
if type(dcol) is int:
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)

Importing Excel data with merging cells

How we can import the excel data with merged cells ?
Please find the excel sheet image.
Last column has 3 sub columns. How we can import without making changes at excel sheet ?
You could try this
# Store data in variable
dataset = 'Merged_Column_Data.xlsx'
# Import dataset and skip row 1
df = pd.read_excel(dataset,skiprows=1)
Unnamed: 0 Unnamed: 1 Unnamed: 2 Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7
# Create dictionary to handle unnamed columns
col_dict = {'Unnamed: 0':'Country', 'Unnamed: 1':'Country',
'Unnamed: 2':'Year',}
# Rename columns with dictionary
df.rename(columns=col_dict)
Country Country Year Gold Silver Bronze
0 Great Britain GBR 2012 29 17 19
1 China CHN 2012 38 28 22
2 Russia RUS 2012 24 25 32
3 United States US 2012 46 28 29
4 Korea KOR 2012 13 8 7

Pandas: transform column's values in independent columns

I have Pandas DataFrame which looks like following (df_olymic).
I would like the values of column Type to be transformed in independent columns (df_olympic_table)
Original dataframe
In [3]: df_olympic
Out[3]:
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 19
Transformed dataframe
In [5]: df_olympic_table
Out[5]:
Country N_Gold N_Silver N_Bronze
0 USA 46 37 38
1 GB 27 23 17
2 China 26 18 26
3 Russia 19 18 19
What would be the most convenient way to achieve this?
You can use DataFrame.pivot:
df = df.pivot(index='Country', columns='Type', values='Num')
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
Another solution with DataFrame.set_index and Series.unstack:
df = df.set_index(['Country','Type'])['Num'].unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
but if get:
ValueError: Index contains duplicate entries, cannot reshape
need pivot_table with some aggreagte function, by default it is np.mean, but you can use sum, first...
#add new row with duplicates value in 'Country' and 'Type'
print (df)
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 20 < - changed value to 20
11 Russia Bronze 100 < - add new row with duplicates
df = df.pivot_table(index='Country', columns='Type', values='Num', aggfunc=np.mean)
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37
Or groupby with aggreagting mean and reshape by unstack:
df = df.groupby(['Country','Type'])['Num'].mean().unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37

Categories

Resources