More efficient way to group data based on a mapping file - python

I can do this in excel but I am looking for a way to do this in python. Do you know of a way to do the following
Initial
District_1 District_2 District_3
Food 69 47 65
Water 87 86 32
Shelter 63 84 27
Mapping
District_1 London
District_2 London
District_3 Boston
Desired
London Boston
Food 116 65
Water 173 32
Shelter 147 27

I'm going to assume your mapping is a dictionary
mapping = {'District_1': 'London', 'District_2': 'London', 'District_3': 'Boston'}
Then use groupby with axis=1
df.groupby(mapping, axis=1).sum()
Boston London
Food 65 116
Water 32 173
Shelter 27 147
When you pass a dictionary to groupby, its get method gets applied to the axis of choice (axis=0 by default) and the result defines the groups.

Related

Finding Matching ID's based on similar string match

I have a large pandas dataframe ( 10 million records) shown below (snapshot) :
CID Address
100 22 park street springvale nsw2655
101 U111 28 james road, Vic 2755
102 22 park st. springvale, nsw-2655
103 29 Bino Avenue , Mac - 3990
104 Unit 111 28 James rd, Vic 2755
105 Unit 111 28 James rd, Victoria 2755
I want to self-join with the same dataframe to get a list of matching CID (Customer IDs) having the same/similar addresses in a pandas dataframe.
I have tried using fuzzywuzzy but it's taking long time just to find the matches
Expected Output :
CID Address
100 [102]
101 [104,105]
102 [100]
103
104 [101,105]
105 [101,104]
what is the best way to solve this ?

How to resolve, list index out of range, from scraping website?

from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
For some reason, the last line with header one gives me an error, "list index out of range". I am not too sure what is causing this error to happen, but I know I need this line. Here is a link to the website I am using for the data, https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. The specific table I want is the one that is below the horizontal bar chart.
Traceback
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-47-67ef2aac7bf3> in <module>
28 data_tables.append(td.findAll('table'))
29
---> 30 header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
31
32 header1
IndexError: list index out of range
Use pandas.read_html
Read HTML tables into a list of DataFrame objects.
This answer side-steps the question to provide a more efficient method for extracting tables from Wikipedia and gives the OP the desired end result.
The following code will more easily get the desired table from the Wikipedia page.
.read_html will return a list of dataframes.
The table you're interested in, is at index 4
Clean the table
Select the rows and columns with valid data.
This method does return the table headers, but the column names are multi-level so we'll rename them.
Before renaming the columns, if you need the original data from the column names, use us_covid_data.columns which will return a list of tuples with all the column name values.
import pandas as pd
# get list of dataframes and select index 4
us_covid_data = pd.read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States')[4]
# select rows and columns
us_covid_data = us_covid_data.iloc[0:56, 1:6]
# rename columns
us_covid_data.columns = ['state_territory', 'cases', 'deaths', 'recovered', 'hospitalized']
# display(us_covid_data)
state_territory cases deaths recovered hospitalized
0 Alabama 45785 1033 22082 2961
1 Alaska 1184 17 560 78
2 American Samoa 0 0 – –
3 Arizona 116892 2082 – 5272
4 Arkansas 24253 292 17834 1604
5 California 296499 6711 – –
6 Colorado 34316 1704 – 5527
7 Connecticut 46976 4338 – –
8 Delaware 12293 512 6778 –
9 District of Columbia 10569 561 1465 –
10 Florida 244151 4102 – 15150
11 Georgia 111211 2965 – 11919
12 Guam 1272 6 179 –
13 Hawaii 1012 19 746 116
14 Idaho 8222 94 2886 350
15 Illinois 151767 7144 – –
16 Indiana 49560 2698 36788 7139
17 Iowa 31906 725 24242 –
18 Kansas 17618 282 – 1269
19 Kentucky 17526 623 4785 2662
20 Louisiana 66435 3296 43026 –
21 Maine 3440 110 2787 354
22 Maryland 70497 3246 – 10939
23 Massachusetts 111110 8296 88725 10985
24 Michigan 73403 6225 52841 –
25 Minnesota 38606 1511 33907 4112
26 Mississippi 31257 1114 22167 2881
27 Missouri 24985 1077 – –
28 Montana 1249 23 678 89
29 Nebraska 20053 286 14641 1224
30 Nevada 22930 537 – –
31 New Hampshire 5914 382 4684 558
32 New Jersey 174628 15479 31014 –
33 New Mexico 14549 539 6181 2161
34 New York 400299 32307 71371 –
35 North Carolina 81331 1479 55318 –
36 North Dakota 3858 89 3350 218
37 Northern Mariana Islands 31 2 19 –
38 Ohio 57956 2927 – 7292
39 Oklahoma 16362 399 12432 1676
40 Oregon 10402 218 2846 1069
41 Pennsylvania 93876 6880 – –
42 Puerto Rico 8714 157 – –
43 Rhode Island 16991 960 – 1922
44 South Carolina 47214 838 – –
45 South Dakota 7105 97 6062 689
46 Tennessee 51509 646 31020 2860
47 Texas 240111 3013 122996 9610
48 Virgin Islands 112 6 79 –
49 Utah 25563 190 14448 1565
50 Vermont 1251 56 1022 –
51 Virginia 66740 1881 – 9549
52 Washington 38517 1370 – 4463
53 West Virginia 3461 95 2518 –
54 Wisconsin 35318 805 25542 3574
55 Wyoming 1675 20 1172 253
Addressing the original issue:
data is an empty list generated from data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
With data = data_table.tbody.findAll('tr', recursive=False)[1] and then data = [v for v in data.get_text().split('\n') if v], you will get the headers.
The output of data will be ['U.S. state or territory[i]', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]', 'Ref.']
Since data_tables is generated from iterating through data, it is also empty.
header1 is generated from iterating data_tables[0], so IndexError occurs because data_tables is empty.

I would like to create a Dataframe from an updated API data OR iterating a dictionary using "for" loop

I need to create a dict first, then a dataframe from this:
# timeline request (single)
timeline = trading.in_play_service.get_event_timeline(
event_id=event_ids[0]
)
print(timeline)
for update in timeline.update_detail:
print(
update.update_type,
update.elapsed_added_time,
update.team_name,
update.update_id,
update.elapsed_regular_time,
update.update_time,
) # view resources or debug to see all values available
Right now it only prints the updated data coming from the API of Betfair and I get this result:
EventTimeline
KickOff None None 9 1 2019-12-20 08:42:44.554000
YellowCard None Western United 37 22 2019-12-20 09:04:09.506000
YellowCard None Western Sydney Wanderers 51 30 2019-12-20 09:12:00.604000
YellowCard None Western Sydney Wanderers 65 38 2019-12-20 09:20:44.413000
FirstHalfEnd 2 None 84 45 2019-12-20 09:29:46.558000
SecondHalfKickOff None None 87 46 2019-12-20 09:45:46.177000
YellowCard None Western United 105 61 2019-12-20 10:00:55.977000
Goal None Western Sydney Wanderers 136 79 2019-12-20 10:19:18.448000
Goal None Western United 147 87 2019-12-20 10:27:16.620000
SecondHalfEnd 6 None 163 90 2019-12-20 10:35:56.159000
Any help with doing it? I am quite a beginner and trying to solve so many things by myself in my journey to automatize my trading. Any help appreciated and open for collaboration too.
What you can do is to create an empty DataFrame like that:
dfObj = pd.DataFrame(columns=['Update_time', 'Team_name', etc...])
then, you can fill the data in the dataFrame object :
for update in timeline.update_detail:
dfObj = dfObj.append({'Update_time':update.update_type , 'Team_name': update.team
, etc.})
After doing this, print the dataFrame object:
print("Dataframe Contens ", dfObj, sep='\n')
Now, you should see the dataFrame with your data. Note that tmy code will not work in this form (it is a bit pseudo). Furthermore, you should actually define the types of your columns in the dataFrame when initializing the dataFrame object.

incorrect results in sorting a datasets in descending order

incorrect sorting results for descending order
I tried sorting this dataset on the basis of release clause but it aint working.it should have shown top player like neymar or ronaldo with high release clauses but its showing some vague results.
Datasets-https://www.kaggle.com/karangadiya/fifa19/downloads/fifa19.zip/4
df=pd.read_csv('data.csv')
df1=df[['Name','Age','Overall','Release Clause']]
df1.sort_values(by='Release Clause',ascending=False,na_position='last').head()
expected:something like this
Name Age Overall Release Clause
0 L. Messi 31 94 €226.5M
1 Cristiano Ronaldo 33 94 €127.1M
2 Neymar Jr 26 92 €228.1M
3 De Gea 27 91 €138.6M
4 K. De Bruyne 27 91 €196.4M
actual output:
Name Age Overall Release Clause
1526 Léo Matos 32 76 €9M
3457 J. Windass 24 72 €9M
1419 Vieirinha 32 76 €9M
2519 P. Mpoku 26 74 €9M
4779 D. Geiger 20 70 €9M
My guess is that the Release Clause is stored as strings and so the sorting is done by lexicographic order ("€226.5M" < "€9M" returns True in Python).
Try to convert the Release Clause field to numbers (see Change data type of columns in Pandas) and it should work fine.

How to merge rows (with strings) based on column value (int) in Pandas dataframe?

I have datasets in the format
df1=
userid movieid tags timestamp
73 130682 b movie 1432523704
73 130682 comedy 1432523704
73 130682 horror 1432523704
77 1199 Trilogy of the Imagination 1163220043
77 2968 Gilliam 1163220138
77 2968 Trilogy of the Imagination 1163220039
77 4467 Trilogy of the Imagination 1163220065
77 4911 Gilliam 1163220167
77 5909 Takashi Miike 1163219591
and I want another dataframe to be in format
df2=
userid tags
73 b movie[1] comedy[1] horror[1]
77 Trilogy of the Imagination[3] Gilliam[1] Takashi Miike[1]
such that I can merge all tags together for word/s count or term frequency.
In sort, I want all tags for one userid together concatenated by " " (one space), such that I can also count number of occurrences of word/s. I am unable to concatenate strings in tags together. I can count words and its occurrences. Any help/advice would be appreciated.
First count and reformat the result of the count per group. Keep it as an intermediate result:
r = df.groupby('userid').apply(lambda g: g.tags.value_counts()).reset_index(level=-1)
r
Out[46]:
level_1 tags
userid
73 b movie 1
73 horror 1
73 comedy 1
77 Trilogy of the Imagination 3
77 Gilliam 2
77 Takashi Miike 1
This simple string manipulation will give you the result per line:
r.level_1+'['+r.tags.astype(str)+']'
Out[49]:
userid
73 b movie[1]
73 horror[1]
73 comedy[1]
77 Trilogy of the Imagination[3]
77 Gilliam[2]
77 Takashi Miike[1]
The neat part of being in Python is to be able to do something like this with it:
(r.level_1+'['+r.tags.astype(str)+']').groupby(level=0).apply(' '.join)
Out[50]:
userid
73 b movie[1] horror[1] comedy[1]
77 Trilogy of the Imagination[3] Gilliam[2] Takas...

Categories

Resources