Pandas inner Join does not give complete results - python

I have two csvs
I am trying to join the two dataframes where the Stocknumber matches, but it is only returning 77 results except the expected 140
here is my code
import pandas as pd
df=pd.read_csv('C:/Users/Starshine/Desktop/vdp.csv')
df = df.iloc[: , :-1]
df['StockNumber']=df['URL'].str[-8:]
df['StockNumber']=df['StockNumber'].str.strip('/')
df['StockNumber']=df['StockNumber'].str.strip('-')
df.to_csv('C:/Users/Starshine/Desktop/panda.csv',index=False)
dfs=pd.read_csv('C:/Users/Starshine/Desktop/a2.csv')
dfs.rename(columns={'Stock #': 'StockNumber'}, inplace=True)
dfs = dfs.iloc[: , :-2]
dfs['Stock']=df['StockNumber']
sf=pd.merge(dfs,df,on='StockNumber')
sf.to_csv('C:/Users/Starshine/Desktop/test21.csv',index=False)
print (sf)
What am I doing wrong here?

pandas.merge is case sensitive. You have to lowercase both columns before the merge.
Try this :
import pandas as pd
df=pd.read_csv('C:/Users/Starshine/Desktop/vdp.csv')
df['StockNumber']=df['URL'].str.rsplit('-').str[-1].str.strip('/').str.lower()
dfs=pd.read_csv('C:/Users/Starshine/Desktop/a2.csv')
dfs.rename(columns={'Stock #': 'StockNumber'}, inplace=True)
dfs['StockNumber'] = df['StockNumber'].str.lower()
sf=pd.merge(dfs,df,on='StockNumber')
>>> Result (There is exaclty 139 matches, not 140)
print(sf)
Vehicle StockNumber \
0 2012 Ford Fusion S a26131
1 2020 Chevrolet Malibu LS 1FL b98795
2 2010 Hyundai Santa Fe GLS 571849
3 2019 Dodge Charger GT c32026
4 2019 Toyota Camry SE 500754
.. ... ...
134 2014 Hyundai Santa Fe Sport 2.4L 656191
135 2015 Jeep Wrangler Unlimited Rubicon 206164
136 2012 Mercedes-Benz E-Class E 350 4MATIC? 545815
137 2013 Lexus GX 460 Premium c60862
138 2007 Ford F-450SD XL DRW c42901
URL Images
0 www.site.com/2007-ford-f450-super-duty-crew-ca... 0
1 www.site.com/2020-ford-f150-supercrew-cab-lari... 0
2 www.site.com/2012-mercedes-benz-e-class-e-350-... 0
3 www.site.com/2014-hyundai-santa-fe-sport-sport... 0
4 www.site.com/2013-nissan-rogue-sv-sport-utilit... 0
.. ... ...
134 www.site.com/2015-nissan-rogue-select-s-sport-... 0
135 www.site.com/2016-chevrolet-ss-sedan-4d-206164/ 0
136 www.site.com/2018-volkswagen-atlas-se-sport-ut... 41
137 www.site.com/2014-lexus-rx-rx-350-sport-utilit... 0
138 www.site.com/2017-ford-f150-supercrew-cab-xlt-... 0
[139 rows x 4 columns]

Related

convert for-loop output into dataframe python

I am trying to convert the output of this code into a dataframe, but do not know how. What is a good way to turn the output columns (string and frequency) into a dataframe?
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor += str(i).lower().split(',')
# Calculate frequency
for string in all_investor:
string = string.strip()
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
print(string, frequency)
Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14
....
Use Series.str.split, reshape by DataFrame.stack, convert to lowercase by Series.str.lower and last count by Series.value_counts:
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
s = df['Select Investors'].str.split(', ', expand=True).stack().str.lower().value_counts()
print (s)
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
almaz capital partners 1
commerzventures 1
sunley house capital management 1
lockheed martin ventures 1
14w 1
Length: 1187, dtype: int64
For DataFrame use:
df = s.rename_axis('values').reset_index(name='count')
print (df)
values count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1182 almaz capital partners 1
1183 commerzventures 1
1184 sunley house capital management 1
1185 lockheed martin ventures 1
1186 14w 1
[1187 rows x 2 columns]
If want modify your solution:
from collections import Counter
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
column = "Select Investors"
all_investor = [j.strip() for i in df[column] for j in str(i).lower().split(',')]
df1 = (pd.DataFrame(Counter(all_investor).items(), columns=['vals','count'])
.sort_values(by='count',ascending=False, ignore_index=True))
print (df1)
vals count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1180 futurex capital 1
1181 quiet capital 1
1182 white star capital 1
1183 almaz capital partners 1
1184 endiya partners 1
[1185 rows x 2 columns]
Use str.split and value_counts:
>>> df['Select Investors'].str.split(', ').explode().str.lower().value_counts()
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
.406 ventures 1
transamerica ventures 1
crane venture partners 1
geekdom fund 1
endiya partners 1
Name: Select Investors, Length: 1187, dtype: int64
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor += [j.strip() for j in str(i).lower().split(',') if j.strip()]
all_investor = set(all_investor)
all_data = []
# Calculate frequency
for string in all_investor:
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
all_data.append([string, frequency])
new_df = pd.DataFrame(all_data, columns=["Investor", "Frequency"])
new_df = new_df.sort_values(by='Frequency',ascending=False).reset_index()
new_df
Output:
index Investor Frequency
0 854 sequoia capital 115
1 626 ing 90
2 765 accel 62
3 1025 tiger global 50
4 964 softbank 50
... ... ... ...
1180 486 quiet capital 1
1181 487 york capital management 1
1182 489 ewtp capital 1
1183 490 kleiner perkins caulfield & byers 1
1184 1184 google capital 1

How to run hypothesis test with pandas data frame and specific conditions?

I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924

Data not showing in table form when using Jupyter Notebook

I ran the below code in Jupyter Notebook, I was expecting the output to appear like an excel table but instead the output was split up and not in a table. How can I get it to show up in table format?
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("Robbery_2014_to_2019.csv")
print(df.head())
Output:
X Y Index_ event_unique_id occurrencedate \
0 -79.270393 43.807190 17430 GO-2015134200 2015-01-23T14:52:00.000Z
1 -79.488281 43.764091 19205 GO-20142956833 2014-09-21T23:30:00.000Z
2 -79.215836 43.761856 15831 GO-2015928336 2015-03-23T11:30:00.000Z
3 -79.436264 43.642963 16727 GO-20142711563 2014-08-15T22:00:00.000Z
4 -79.369461 43.654526 20091 GO-20142492469 2014-07-12T19:00:00.000Z
reporteddate premisetype ucr_code ucr_ext \
0 2015-01-23T14:57:00.000Z Outside 1610 210
1 2014-09-21T23:37:00.000Z Outside 1610 200
2 2015-06-03T15:08:00.000Z Other 1610 220
3 2014-08-16T00:09:00.000Z Apartment 1610 200
4 2014-07-14T01:35:00.000Z Apartment 1610 100
offence ... occurrencedayofyear occurrencedayofweek \
0 Robbery - Business ... 23.0 Friday
1 Robbery - Mugging ... 264.0 Sunday
2 Robbery - Other ... 82.0 Monday
3 Robbery - Mugging ... 227.0 Friday
4 Robbery With Weapon ... 193.0 Saturday
occurrencehour MCI Division Hood_ID Neighbourhood \
0 14 Robbery D42 129 Agincourt North (129)
1 23 Robbery D31 27 York University Heights (27)
2 11 Robbery D43 137 Woburn (137)
3 22 Robbery D11 86 Roncesvalles (86)
4 19 Robbery D51 73 Moss Park (73)
Long Lat ObjectId
0 -79.270393 43.807190 2001
1 -79.488281 43.764091 2002
2 -79.215836 43.761856 2003
3 -79.436264 43.642963 2004
4 -79.369461 43.654526 2005
[5 rows x 29 columns]
Use display(df.head()) (produces slightly nicer output than without display()
Print function is applied to represent any kind of information like string or estimated value.
Whereas Display() will display the dataset in

In a pandas dataframe, count the number of times a condition occurs in one column?

Background
I have five years of NO2 measurement data, in csv files-one file for every location and year. I have loaded all the files into pandas dataframes in the same format:
Date Hour Location NO2_Level
0 01/01/2016 00 Street 18
1 01/01/2016 01 Street 39
2 01/01/2016 02 Street 129
3 01/01/2016 03 Street 76
4 01/01/2016 04 Street 40
Goal
For each dataframe count the number of times NO2_Level is greater than 150 and output this.
So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately .
Problem
Whatever I've tried produces results I know on inspection are incorrect, e.g :
-the count value for every location on a given year is the same (possible but unlikely)
-for a year when I know there should be any positive number for the count, every location returns 0
What I've tried
I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series:
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''
Using pd.count():
count = df[df['NO2_Level'] >= 150].count()
These two approaches have gotten closest to what I want to output
Example to test on
data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location': ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count
Expected Outputs
So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition):
Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158
So the above example would produce
Street, 2016, 1
Actual
Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be:
Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43
Hopefully this helps.
import pandas as pd
ddict = {
'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
'Hour':['00','01','02','03','04','02'],
'Location':['Street','Street','Street','Street','Street','Street',],
'N02_Level':[19,39,129,76,40, 151],
}
df = pd.DataFrame(ddict)
# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))
# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')
# Interate the results
for i in range(len(df1)):
loc = df1['Location'][i]
yr = df1['Year'][i]
cnt = df1['Count'][i]
print(f'{loc},{yr},{cnt}')
### To not use f-strings
for i in range(len(df1)):
print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))
Sample data:
Date Hour Location N02_Level
0 2016-01-01 00 Street 19
1 2016-01-01 01 Street 39
2 2016-01-01 02 Street 129
3 2016-01-01 03 Street 76
4 2016-01-01 04 Street 40
5 2016-01-02 02 Street 151
Output:
Street,2016,1
here is a solution with a sample generated (randomly):
def random_dates(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
location = ['street', 'avenue', 'road', 'town', 'campaign']
df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
'Location' : np.random.choice(location, 20),
'NOE_level' : np.random.randint(low=130, high= 200, size=20)})
#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")
print(df)
df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)
Example df generated:
Date Location NOE_level
0 2018 town 191
1 2017 campaign 187
2 2017 town 137
3 2016 avenue 148
4 2017 campaign 195
5 2018 town 181
6 2018 road 187
7 2018 town 184
8 2016 town 155
9 2016 street 183
10 2018 road 136
11 2017 road 171
12 2018 street 165
13 2015 avenue 193
14 2016 campaign 170
15 2016 street 132
16 2016 campaign 165
17 2015 road 161
18 2018 road 161
19 2015 road 140
output:
Location Date count
0 avenue 2015 1
1 avenue 2016 0
2 campaign 2016 2
3 campaign 2017 2
4 road 2015 1
5 road 2017 1
6 road 2018 2
7 street 2016 1
8 street 2018 1
9 town 2016 1
10 town 2017 0
11 town 2018 3

How to convert list to pandas DataFrame?

I use BeautifulSoup to get some data from a webpage:
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'html5lib')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
df.head()
But df is a list, not the pandas DataFrame as I expected from using pd.read_html.
How can I get pandas DataFrame out of it?
You can use read_html with your url:
df = pd.read_html("http://www.nationmaster.com/country-info/stats/Media/Internet-users")[0]
And then if necessary remove GRAPH and HISTORY columns and replace NaNs in column # by forward filling:
df = df.drop(['GRAPH','HISTORY'], axis=1)
df['#'] = df['#'].ffill()
print(df.head())
# COUNTRY AMOUNT DATE
0 1 China 389 million 2009
1 2 United States 245 million 2009
2 3 Japan 99.18 million 2009
3 3 Group of 7 countries (G7) average (profile) 80.32 million 2009
4 4 Brazil 75.98 million 2009
print(df.tail())
# COUNTRY AMOUNT DATE
244 214 Niue 1100 2009
245 =215 Saint Helena, Ascension, and Tristan da Cunha 900 2009
246 =215 Saint Helena 900 2009
247 217 Tokelau 800 2008
248 218 Christmas Island 464 2001

Categories

Resources