convert for-loop output into dataframe python - python

I am trying to convert the output of this code into a dataframe, but do not know how. What is a good way to turn the output columns (string and frequency) into a dataframe?
# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor += str(i).lower().split(',')
# Calculate frequency
for string in all_investor:
string = string.strip()
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
print(string, frequency)
Output:
andreessen horowitz 41
new enterprise associates 21
battery ventures 14
index ventures 30
dst global 19
ribbit capital 8
forerunner ventures 4
crosslink capital 4
homebrew 2
sequoia capital 115
thoma bravo 3
softbank 50
tencent holdings 28
lightspeed india partners 4
sequoia capital india 25
ggv capital 14
....

Use Series.str.split, reshape by DataFrame.stack, convert to lowercase by Series.str.lower and last count by Series.value_counts:
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
s = df['Select Investors'].str.split(', ', expand=True).stack().str.lower().value_counts()
print (s)
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
almaz capital partners 1
commerzventures 1
sunley house capital management 1
lockheed martin ventures 1
14w 1
Length: 1187, dtype: int64
For DataFrame use:
df = s.rename_axis('values').reset_index(name='count')
print (df)
values count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1182 almaz capital partners 1
1183 commerzventures 1
1184 sunley house capital management 1
1185 lockheed martin ventures 1
1186 14w 1
[1187 rows x 2 columns]
If want modify your solution:
from collections import Counter
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)[0]
column = "Select Investors"
all_investor = [j.strip() for i in df[column] for j in str(i).lower().split(',')]
df1 = (pd.DataFrame(Counter(all_investor).items(), columns=['vals','count'])
.sort_values(by='count',ascending=False, ignore_index=True))
print (df1)
vals count
0 accel 54
1 tiger global management 48
2 sequoia capital china 46
3 andreessen horowitz 42
4 sequoia capital 41
... ...
1180 futurex capital 1
1181 quiet capital 1
1182 white star capital 1
1183 almaz capital partners 1
1184 endiya partners 1
[1185 rows x 2 columns]

Use str.split and value_counts:
>>> df['Select Investors'].str.split(', ').explode().str.lower().value_counts()
accel 54
tiger global management 48
sequoia capital china 46
andreessen horowitz 42
sequoia capital 41
..
.406 ventures 1
transamerica ventures 1
crane venture partners 1
geekdom fund 1
endiya partners 1
Name: Select Investors, Length: 1187, dtype: int64

# Extract data
url = "https://www.cbinsights.com/research-unicorn-companies"
df = pd.read_html(url)
first_df = df[0]
column = "Select Investors"
all_investor = []
for i in first_df[column]:
all_investor += [j.strip() for j in str(i).lower().split(',') if j.strip()]
all_investor = set(all_investor)
all_data = []
# Calculate frequency
for string in all_investor:
frequency = first_df[column].apply(
lambda x: string in str(x).lower()).sum()
all_data.append([string, frequency])
new_df = pd.DataFrame(all_data, columns=["Investor", "Frequency"])
new_df = new_df.sort_values(by='Frequency',ascending=False).reset_index()
new_df
Output:
index Investor Frequency
0 854 sequoia capital 115
1 626 ing 90
2 765 accel 62
3 1025 tiger global 50
4 964 softbank 50
... ... ... ...
1180 486 quiet capital 1
1181 487 york capital management 1
1182 489 ewtp capital 1
1183 490 kleiner perkins caulfield & byers 1
1184 1184 google capital 1

Related

Pandas inner Join does not give complete results

I have two csvs
I am trying to join the two dataframes where the Stocknumber matches, but it is only returning 77 results except the expected 140
here is my code
import pandas as pd
df=pd.read_csv('C:/Users/Starshine/Desktop/vdp.csv')
df = df.iloc[: , :-1]
df['StockNumber']=df['URL'].str[-8:]
df['StockNumber']=df['StockNumber'].str.strip('/')
df['StockNumber']=df['StockNumber'].str.strip('-')
df.to_csv('C:/Users/Starshine/Desktop/panda.csv',index=False)
dfs=pd.read_csv('C:/Users/Starshine/Desktop/a2.csv')
dfs.rename(columns={'Stock #': 'StockNumber'}, inplace=True)
dfs = dfs.iloc[: , :-2]
dfs['Stock']=df['StockNumber']
sf=pd.merge(dfs,df,on='StockNumber')
sf.to_csv('C:/Users/Starshine/Desktop/test21.csv',index=False)
print (sf)
What am I doing wrong here?
pandas.merge is case sensitive. You have to lowercase both columns before the merge.
Try this :
import pandas as pd
df=pd.read_csv('C:/Users/Starshine/Desktop/vdp.csv')
df['StockNumber']=df['URL'].str.rsplit('-').str[-1].str.strip('/').str.lower()
dfs=pd.read_csv('C:/Users/Starshine/Desktop/a2.csv')
dfs.rename(columns={'Stock #': 'StockNumber'}, inplace=True)
dfs['StockNumber'] = df['StockNumber'].str.lower()
sf=pd.merge(dfs,df,on='StockNumber')
>>> Result (There is exaclty 139 matches, not 140)
print(sf)
Vehicle StockNumber \
0 2012 Ford Fusion S a26131
1 2020 Chevrolet Malibu LS 1FL b98795
2 2010 Hyundai Santa Fe GLS 571849
3 2019 Dodge Charger GT c32026
4 2019 Toyota Camry SE 500754
.. ... ...
134 2014 Hyundai Santa Fe Sport 2.4L 656191
135 2015 Jeep Wrangler Unlimited Rubicon 206164
136 2012 Mercedes-Benz E-Class E 350 4MATIC? 545815
137 2013 Lexus GX 460 Premium c60862
138 2007 Ford F-450SD XL DRW c42901
URL Images
0 www.site.com/2007-ford-f450-super-duty-crew-ca... 0
1 www.site.com/2020-ford-f150-supercrew-cab-lari... 0
2 www.site.com/2012-mercedes-benz-e-class-e-350-... 0
3 www.site.com/2014-hyundai-santa-fe-sport-sport... 0
4 www.site.com/2013-nissan-rogue-sv-sport-utilit... 0
.. ... ...
134 www.site.com/2015-nissan-rogue-select-s-sport-... 0
135 www.site.com/2016-chevrolet-ss-sedan-4d-206164/ 0
136 www.site.com/2018-volkswagen-atlas-se-sport-ut... 41
137 www.site.com/2014-lexus-rx-rx-350-sport-utilit... 0
138 www.site.com/2017-ford-f150-supercrew-cab-xlt-... 0
[139 rows x 4 columns]

Filter or selecting data between two rows in pandas by multiple labels

So I have this df or table coming from a pdf tranformation on this way example:
ElementRow
ElementColumn
ElementPage
ElementText
X1
Y1
X2
Y2
1
50
0
1
Emergency Contacts
917
8793
2191
8878
2
51
0
1
Contact
1093
1320
1451
1388
3
51
2
1
Relationship
2444
1320
3026
1388
4
51
7
1
Work Phone
3329
1320
3898
1388
5
51
9
1
Home Phone
4260
1320
4857
1388
6
51
10
1
Cell Phone
5176
1320
5684
1388
7
51
12
1
Priority Phone
6143
1320
6495
1388
8
51
14
1
Contact Address
6542
1320
7300
1388
9
51
17
1
City
7939
1320
7300
1388
10
51
18
1
State
8808
1320
8137
1388
11
51
21
1
Zip
9134
1320
9294
1388
12
52
0
1
Silvia Smith
1093
1458
1973
1526
13
52
2
1
Mother
2444
1458
2783
1526
13
52
7
1
(123) 456-78910
5176
1458
4979
1526
14
52
10
1
Austin
7939
1458
8406
1526
15
52
15
1
Texas
8808
1458
8961
1526
16
52
20
1
76063
9134
1458
9421
1526
17
52
2
1
1234 Parkside Ct
6542
1458
9421
1526
18
53
0
1
Naomi Smith
1093
2350
1973
1526
19
53
2
1
Aunt
2444
2350
2783
1526
20
53
7
1
(123) 456-78910
5176
2350
4979
1526
21
53
10
1
Austin
7939
2350
8406
1526
22
53
15
1
Texas
8808
2350
8961
1526
23
53
20
1
76063
9134
2350
9421
1526
24
53
2
1
3456 Parkside Ct
6542
2350
9421
1526
25
54
40
1
End Employee Line
6542
2350
9421
1526
25
55
0
1
Emergency Contacts
917
8793
2350
8878
I'm trying to separate each register by rows taking as a reference ElementRow column and keep the headers from the first rows and then iterate through the other rows after. The column X1 has a reference on which header should be the values. I would like to have the data like this way.
Contact
Relationship
Work Phone
Cell Phone
Priority
ContactAddress
City
State
Zip
1
Silvia Smith
Mother
(123) 456-78910
1234 Parkside Ct
Austin
Texas
76063
2
Naomi Smith
Aunt
(123) 456-78910
3456 Parkside Ct
Austin
Texas
76063
Things I tried:
To take rows between iterating through the columns. tried to slice taking the first index and the last index but showed this error:
emergStartIndex = df.index[df['ElementText'] == 'Emergency Contacts']
emergLastIndex = df.index[df['ElementText'] == 'End Employee Line']
emerRows_between = df.iloc[emergStartIndex:emergLastIndex]
TypeError: cannot do positional indexing on RangeIndex with these indexers [Int64Index([...
That way is working with this numpy trick.
emerRows_between = df.iloc[np.r_[1:54,55:107]]
emerRows_between
but when trying to replace the index showed this:
emerRows_between = df.iloc[np.r_[emergStartIndex:emergLastIndex]]
emerRows_between
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I tried iterating row by row like this but in some point the df reach the end and I'm receiving index out of bound.
emergencyContactRow1 = df['ElementText','X1'].iloc[emergStartIndex+1].reset_index(drop=True)
emergencyContactRow2 = df['ElementText','X1'].iloc[emergStartIndex+2].reset_index(drop=True)
emergencyContactRow3 = df['ElementText','X1'].iloc[emergStartIndex+3].reset_index(drop=True)
emergencyContactRow4 = df['ElementText','X1'].iloc[emergStartIndex+4].reset_index(drop=True)
emergencyContactRow5 = df['ElementText','X1'].iloc[emergStartIndex+5].reset_index(drop=True)
emergencyContactRow6 = df['ElementText','X1'].iloc[emergStartIndex+6].reset_index(drop=True)
emergencyContactRow7 = df['ElementText','X1'].iloc[emergStartIndex+7].reset_index(drop=True)
emergencyContactRow8 = df['ElementText','X1'].iloc[emergStartIndex+8].reset_index(drop=True)
emergencyContactRow9 = df['ElementText','X1'].iloc[emergStartIndex+9].reset_index(drop=True)
emergencyContactRow10 = df['ElementText','X1'].iloc[emergStartIndex+10].reset_index(drop=True)
frameEmergContact1 = [emergencyContactRow1 , emergencyContactRow2 , emergencyContactRow3, emergencyContactRow4, emergencyContactRow5, emergencyContactRow6, emergencyContactRow7, , emergencyContactRow8,, emergencyContactRow9, , emergencyContactRow10]
df_emergContact1= pd.concat(frameEmergContact1 , axis=1)
df_emergContact1.columns = range(df_emergContact1.shape[1])
So how to make this code dynamic or how to avoid the index out of bound errors and keep my headers taking as a reference only the first row after the Emergency Contact row?. I know I didn't try to use the X1 column yet, but I have to resolve first how to iterate through those multiple indexes.
Each iteration from Emergency Contact index to End Employee line belongs to one person or one employee from the whole dataframe, so the idea after capture all those values is to keep also a counter variable to see how many times the data is captured between those two indexes.
It's a bit ugly, but this should do it. Basically you don't need the first or last two rows, so if you get rid of those, then pivot the X1 and ElemenTex columns you will be pretty close. Then it's a matter of getting rid of null values and promoting the first row to header.
df = df.iloc[1:-2][['ElementTex','X1','ElementRow']].pivot(columns='X1',values='ElementTex')
df = pd.DataFrame([x[~pd.isnull(x)] for x in df.values.T]).T
df.columns = df.iloc[0]
df = df[1:]
Split the dataframe into chunks whenever "Emergency Contacts" appears in column "ElementText"
Parse each chunk into the required format
Append to the output
import numpy as np
list_of_df = np.array_split(data, data[data["ElementText"]=="Emergency Contacts"].index)
output = pd.DataFrame()
for frame in list_of_df:
df = frame[~frame["ElementText"].isin(["Emergency Contacts", "End Employee Line"])].dropna()
if df.shape[0]>0:
temp = pd.DataFrame(df.groupby("X1")["ElementText"].apply(list).tolist()).T
temp.columns = temp.iloc[0]
temp = temp.drop(0)
output = output.append(temp, ignore_index=True)
>>> output
0 Contact Relationship Work Phone ... City State Zip
0 Silvia Smith Mother None ... Austin Texas 76063
1 Naomi Smith Aunt None ... Austin Texas 76063

How to run hypothesis test with pandas data frame and specific conditions?

I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924

In a pandas dataframe, count the number of times a condition occurs in one column?

Background
I have five years of NO2 measurement data, in csv files-one file for every location and year. I have loaded all the files into pandas dataframes in the same format:
Date Hour Location NO2_Level
0 01/01/2016 00 Street 18
1 01/01/2016 01 Street 39
2 01/01/2016 02 Street 129
3 01/01/2016 03 Street 76
4 01/01/2016 04 Street 40
Goal
For each dataframe count the number of times NO2_Level is greater than 150 and output this.
So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately .
Problem
Whatever I've tried produces results I know on inspection are incorrect, e.g :
-the count value for every location on a given year is the same (possible but unlikely)
-for a year when I know there should be any positive number for the count, every location returns 0
What I've tried
I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series:
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''
Using pd.count():
count = df[df['NO2_Level'] >= 150].count()
These two approaches have gotten closest to what I want to output
Example to test on
data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location': ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count
Expected Outputs
So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition):
Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158
So the above example would produce
Street, 2016, 1
Actual
Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be:
Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43
Hopefully this helps.
import pandas as pd
ddict = {
'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
'Hour':['00','01','02','03','04','02'],
'Location':['Street','Street','Street','Street','Street','Street',],
'N02_Level':[19,39,129,76,40, 151],
}
df = pd.DataFrame(ddict)
# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))
# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')
# Interate the results
for i in range(len(df1)):
loc = df1['Location'][i]
yr = df1['Year'][i]
cnt = df1['Count'][i]
print(f'{loc},{yr},{cnt}')
### To not use f-strings
for i in range(len(df1)):
print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))
Sample data:
Date Hour Location N02_Level
0 2016-01-01 00 Street 19
1 2016-01-01 01 Street 39
2 2016-01-01 02 Street 129
3 2016-01-01 03 Street 76
4 2016-01-01 04 Street 40
5 2016-01-02 02 Street 151
Output:
Street,2016,1
here is a solution with a sample generated (randomly):
def random_dates(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
location = ['street', 'avenue', 'road', 'town', 'campaign']
df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
'Location' : np.random.choice(location, 20),
'NOE_level' : np.random.randint(low=130, high= 200, size=20)})
#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")
print(df)
df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)
Example df generated:
Date Location NOE_level
0 2018 town 191
1 2017 campaign 187
2 2017 town 137
3 2016 avenue 148
4 2017 campaign 195
5 2018 town 181
6 2018 road 187
7 2018 town 184
8 2016 town 155
9 2016 street 183
10 2018 road 136
11 2017 road 171
12 2018 street 165
13 2015 avenue 193
14 2016 campaign 170
15 2016 street 132
16 2016 campaign 165
17 2015 road 161
18 2018 road 161
19 2015 road 140
output:
Location Date count
0 avenue 2015 1
1 avenue 2016 0
2 campaign 2016 2
3 campaign 2017 2
4 road 2015 1
5 road 2017 1
6 road 2018 2
7 street 2016 1
8 street 2018 1
9 town 2016 1
10 town 2017 0
11 town 2018 3

Cannot assign a value to certain columns in Pandas

Hi I am trying to assign certain values in columns of a dataframe.
# Count the number of title counts
full.groupby(['Sex', 'Title']).Title.count()
Sex Title
female Dona 1
Dr 1
Lady 1
Miss 260
Mlle 2
Mme 1
Mrs 197
Ms 2
the Countess 1
male Capt 1
Col 4
Don 1
Dr 7
Jonkheer 1
Major 2
Master 61
Mr 757
Rev 8
Sir 1
Name: Title, dtype: int64
My tail of dataframe looks like follows:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket Title
413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236 Mr
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758 Dona
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262 Mr
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309 Mr
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668 Master
The name of my dataframe is full and I want to change names of Title.
Here is the following code I wrote :
# Create a variable rate_title to modify the names of Title
rare_title = ['Dona', "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"]
# Also reassign mlle, ms, and mme accordingly
full[full.Title == "Mlle"].Title = "Miss"
full[full.Title == "Ms"].Title = "Miss"
full[full.Title == "Mme"].Title = "Mrs"
full[full.Title.isin(rare_title)].Title = "Rare Title"
I also tried the following code in pandas:
full.loc[full['Title'] == "Mlle", ['Sex', 'Title']] = "Miss"
Still the dataframe is not changed. Any help is appreciated.
Use loc based indexing and set matching row values -
miss = ['Mlle', 'Ms', 'Mme']
rare_title = ['Dona', "Lady", ...]
df.loc[df.Title.isin(miss), 'Title'] = 'Miss'
df.loc[df.Title.isin(rare_title), 'Title'] = 'Rare Title'

Categories

Resources