How to compare fields from two CSV files with an arithmetic condition?

How to compare fields from two CSV files with an arithmetic condition? - python

I have two csv files. The first file contains names of all countries with their capital cities,
CSV 1:
Capital Country Country Code
Budapest Hungary HUN
Rome Italy ITA
Dublin Ireland IRL
Paris France FRA
Berlin Germany DEU
.
.
.
CSV 2:
Second CSV file contains trip details of a bus,
Trip City Trip Country No. of pax
Budapest HUN 24
Paris FRA 36
Munich DEU 9
Florence ITA 5
Milan ITA 25
Rome ITA 2
Rome ITA 45
I would like to add a new column df["Touism visit"] with the values of no of pax, if the Trip City (from CSV 2) is a capital of a country (from CSV 1) and if the number of pax is more than 10.
Thank you.

Try this:
df2['tourism'] = 0
df2.loc[df2['Trip City'].isin(df1['Capital']) & (df2['No. of pax'] > 10), 'tourism'] = df2.loc[df2['Trip City'].isin(df1['Capital'])& (df2['No. of pax'] > 10), 'No. of pax']
I get :
Trip_City Trip_Country No._of_pax tourism
0 Budapest HUN 24 24
1 Paris FRA 36 36
2 Munich DEU 9 0
3 Florence ITA 5 0
4 Milan ITA 25 0
5 Rome ITA 2 0
6 Rome ITA 45 45
(I had to add _s to get pd.read_clipboard() to work properly)

this might also help,
import the dfs
df1 = pd.read_csv("CSV1.csv")
df2 = pd.read_csv("CSV2.csv")
make a dictionary out of the pandas Series
my_dict=dict(zip((df1["Country_Code"]),(df1["Capital"])))
define a function that test your conditions (note i used np.logical_and() to combine the conditions. A normal and
def isTourism(country_code,trip_city,No_of_pax):
if np.logical_and((my_dict[country_code]==trip_city),(No_of_pax >= 10)):
return "Yes"
else:
return "No
call function with map
df2["Tourism"] = list(map(isTourism,df2["Trip Country"],df2["Trip City"], df2["No. Of pax"]))
print(df2)
Trip City Trip Country No. Of pax Tourism
0 Budapest HUN 24 Yes
1 Paris FRA 36 Yes
2 Munich DEU 9 No
3 Florence ITA 5 No
4 Milan ITA 25 No
5 Rome ITA 2 No
6 Rome ITA 45 Yes

If you filter your second dataframe to only the values > 10, you could merge and sum as follows:
import pandas as pd
df1 = pd.DataFrame({'Capital': ['Budapest', 'Rome', 'Dublin', 'Paris',
'Berlin'],
'Country': ['Hungary', 'Italy', 'Ireland', 'France',
'Germany'],
'Country Code': ['HUN', 'ITA', 'IRL', 'FRA', 'DEU']
})
df2 = pd.DataFrame({'Trip City': ['Budapest', 'Paris', 'Munich', 'Florence',
'Milan', 'Rome', 'Rome'],
'Trip Country': ['HUN', 'FRA', 'DEU', 'ITA', 'ITA',
'ITA', 'ITA'],
'No. of pax': [24, 36, 9, 5, 25, 2, 45]
})
df2 = df2[df2['No. of pax'] > 10]
combined = df1.merge(df2,
left_on=['Capital', 'Country Code'],
right_on=['Trip City', 'Trip Country'],
how='left').groupby(['Capital', 'Country Code'],
sort=False,
as_index=False)['No. of pax'].sum()
print combined
This prints:
Capital Country Code No. of pax
0 Budapest HUN 24.0
1 Rome ITA 45.0
2 Dublin IRL NaN
3 Paris FRA 36.0
4 Berlin DEU NaN

Related

How to conditionally filter a Pandas dataframe

I have a Pandas dataframe that looks like this:
import pandas as pd
df = pd.DataFrame({
'city': ['New York','New York','New York','Los Angeles','Los Angeles','Houston','Houston','Houston'],
'airport': ['LGA', 'EWR', 'JFK', 'LAX', 'BUR', 'IAH', 'HOU', 'EFD'],
'distance': [38, 50, 32, 8, 50, 90, 78, 120]
}
df
city airport distance
0 New York LGA 38
1 New York EWR 50
2 New York JFK 32
3 Los Angeles LAX 8
4 Los Angeles BUR 50
5 Houston IAH 90
6 Houston HOU 78
7 Houston EFD 120
I would like to output a separate dataframe based on the following logic:
if the value in the distance column is 40 or less between a given city and associated airport, than keep the row
if, within a given city, there is no distance below 40, then show only the shortest (lowest) distance
The desired dataframe would look like this:
city airport distance
0 New York LGA 38
1 New York JFK 32
3 Los Angeles LAX 8
4 Houston HOU 78 <-- this is returned, even though it's more than 40
How would I do this?
Thanks!

So in your case do with drop_duplicates then combine_first
out = df.sort_values('distance').drop_duplicates('city').combine_first(df.loc[df['distance']<40])
Out[228]:
city airport distance
0 NewYork LGA 38
2 NewYork JFK 32
3 LosAngeles LAX 8
6 Houston HOU 78

Another possible solution, which is based on the following ideas:
Create a dataframe that only contains rows where distance is lower or equal to 40.
Create another dataframe whose rows correspond to the minimum of distance per group of cities.
Concatenate the above two dataframes.
Remove the duplicates.
(pd.concat([tdf.loc[tdf.distance.le(40)],
tdf.iloc[tdf.groupby('city')['distance'].idxmin()]])
.drop_duplicates()
)
Output:
city airport distance
0 New York LGA 38
2 New York JFK 32
3 Los Angeles LAX 8
6 Houston HOU 78

Python Time Series Dates as Columns to Rows

I have the following dataframe in pandas:
3/2/20 3/3/20 Measure State City
5 6 Deaths WA King County
0 0 Deaths CA Orange
14 21 Confirmed WA King County
1 1 Confirmed CA Orange
There are several additional date columns range :
1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/23/20 2/24/20 2/25/20 2/26/20 2/27/20 2/28/20 2/29/20 3/1/20 3/2/20 3/3/20
How do I pivot/reshape so I end up with something like the below, but including all the date columns?
State City Measure Date Value
WA King Deaths 3/2/20 5
WA King Deaths 3/3/20 6
WA King Deaths 3/2/20 14
WA King Deaths 3/3/20 21
CA Orange Deaths 3/2/20 0
CA Orange Deaths 3/3/20 0
CA Orange Confirmed 3/2/20 1
CA Orange Confirmed 3/3/20 1

Per comment above, melt solved my problem:
df.melt(id_vars = ['Measure', 'State', 'City'], var_name = 'Date' , value_vars=['1/22/20', '1/23/20',
'1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20',
'1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20',
'2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20',
'2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20',
'2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20',
'2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20',
'3/2/20', '3/3/20'])

Python DataFrame : find previous row's value before a specific value with same value in other columns

I have a datafame as follows
import pandas as pd
d = {
'Name' : ['James', 'John', 'Peter', 'Thomas', 'Jacob', 'Andrew','John', 'Peter', 'Thomas', 'Jacob', 'Peter', 'Thomas'],
'Order' : [1,1,1,1,1,1,2,2,2,2,3,3],
'Place' : ['Paris', 'London', 'Rome','Paris', 'Venice', 'Rome', 'Paris', 'Paris', 'London', 'Paris', 'Milan', 'Milan']
}
df = pd.DataFrame(d)
Name Order Place
0 James 1 Paris
1 John 1 London
2 Peter 1 Rome
3 Thomas 1 Paris
4 Jacob 1 Venice
5 Andrew 1 Rome
6 John 2 Paris
7 Peter 2 Paris
8 Thomas 2 London
9 Jacob 2 Paris
10 Peter 3 Milan
11 Thomas 3 Milan
[Finished in 0.7s]
The dataframe represents people visiting various cities, Order column defines the order of visit.
I would like find which city people visited before Paris.
Expected dataframe is as follows
Name Order Place
1 John 1 London
2 Peter 1 Rome
4 Jacob 1 Venice
Which is the pythonic way to find it ?

Using merge
s = df.loc[df.Place.eq('Paris'), ['Name', 'Order']]
m = s.assign(Order=s.Order.sub(1))
m.merge(df, on=['Name', 'Order'])
Name Order Place
0 John 1 London
1 Peter 1 Rome
2 Jacob 1 Venice

How to add a dictionary as the last element to a list of dictionaries?

I would like to add a dictionary to a list, which contains several other dictionaries.
I have a list of ten top travel cities:
City Country Population Area
0 Buenos Aires Argentina 2891000 4758
1 Toronto Canada 2800000 2731571
2 Pyeongchang South Korea 2581000 3194
3 Marakesh Morocco 928850 200
4 Albuquerque New Mexico 559277 491
5 Los Cabos Mexico 287651 3750
6 Greenville USA 84554 68
7 Archipelago Sea Finland 60000 8300
8 Walla Walla Valley USA 32237 33
9 Salina Island Italy 4000 27
10 Solta Croatia 1700 59
11 Iguazu Falls Argentina 0 672
I imported the excel with pandas:
import pandas as pd
travel_df = pd.read_excel('./cities.xlsx')
print(travel_df)
cities = travel_df.to_dict('records')
print(cities)
variables = list(cities[0].keys())
I would like to add a 12th element to the end of the list but don't know how to do so:
beijing = {"City" : "Beijing", "Country" : "China", "Population" : "24000000", "Ares" : "6490" }
print(beijing)

Try appending the new row to the DataFrame you read.
travel_df.append(beijing, ignore_index=True)

Convert Pandas DataFrame columns to rows

I have the following dict which I converted to dataframe
players_info = {'Afghanistan': {'Asghar Stanikzai': 809.0,
'Mohammad Nabi': 851.0,
'Mohammad Shahzad': 1713.0,
'Najibullah Zadran': 643.0,
'Samiullah Shenwari': 774.0},
'Australia': {'AJ Finch': 1082.0,
'CL White': 988.0,
'DA Warner': 1691.0,
'GJ Maxwell': 822.0,
'SR Watson': 1465.0},
'England': {'AD Hales': 1340.0,
'EJG Morgan': 1577.0,
'JC Buttler': 985.0,
'KP Pietersen': 1176.0,
'LJ Wright': 759.0}}
pd.DataFrame(players_info)
The resulting output is
But I want the columns to be mapped with rows like the following
Player Team Score
Mohammad Nabi Afghanistan 851.0
Mohammad Shahzad Afghanistan 1713.0
Najibullah Zadran Afghanistan 643.0
JC Buttler England 985.0
KP Pietersen England 1176.0
LJ Wright England 759.0
I tried reset_index but it is not working as I want. How can I do that ?

You need:
df = df.stack().reset_index()
df.columns=['Player', 'Team', 'Score']
Output of df.head(5):
Player Team Score
0 AD Hales Score 1340.0
1 AJ Finch Team 1082.0
2 Asghar Stanikzai Player 809.0
3 CL White Team 988.0
4 DA Warner Team 1691.0

Let's take a stab at this using melt. Should be pretty fast.
df.rename_axis('Player').reset_index().melt('Player').dropna()
Player variable value
2 Asghar Stanikzai Afghanistan 809.0
10 Mohammad Nabi Afghanistan 851.0
11 Mohammad Shahzad Afghanistan 1713.0
12 Najibullah Zadran Afghanistan 643.0
14 Samiullah Shenwari Afghanistan 774.0
16 AJ Finch Australia 1082.0
18 CL White Australia 988.0
19 DA Warner Australia 1691.0
21 GJ Maxwell Australia 822.0
28 SR Watson Australia 1465.0
30 AD Hales England 1340.0
35 EJG Morgan England 1577.0
37 JC Buttler England 985.0
38 KP Pietersen England 1176.0
39 LJ Wright England 759.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compare fields from two CSV files with an arithmetic condition? - python

Related

How to conditionally filter a Pandas dataframe

Python Time Series Dates as Columns to Rows

Python DataFrame : find previous row's value before a specific value with same value in other columns

How to add a dictionary as the last element to a list of dictionaries?

Convert Pandas DataFrame columns to rows

Categories

Resources