I have a list of tuples of some Wikipedia data that I am scraping. I can get it in a dataframe but its all in 1 column I need it broke out into 4 columns to hold each tuple object.
results = wikipedia.search('Kalim_Aajiz')
df = pd.DataFrame()
data = []
for i in results:
wiki_page = wikipedia.page(i)
data = wiki_page.title, wiki_page.url, wiki_page.summary, wiki_page.pageid
dataList = list(data)
print(dataList)
df = df.append(dataList)
DATA RESULTS:
0 Kalim Aajiz
1 https://en.wikipedia.org/wiki/Kalim_Aajiz
2 Kalim Aajiz (1920 – 14 February 2015) was an I...
3 47137025
0 Robert Thurman
1 https://en.wikipedia.org/wiki/Robert_Thurman
2 Robert Alexander Farrar Thurman (born August 3...
3 475367
0 Ruskin Bond
1 https://en.wikipedia.org/wiki/Ruskin_Bond
2 Ruskin Bond (born 19 May 1934) is an Anglo Ind...
3 965456
0 Haldhar Nag
EXPECTED RESULTS:
NAME | URL | DESCRIPTION | ID
Kalim Aajiz https://en.wikipedia.org/wiki/Kalim_Aajiz was an I... 47137025
Format it into a list of dictionaries, and then make a DataFrame at the end.
results = wikipedia.search('Kalim_Aajiz')
data_list = []
for i in results:
wiki_page = wikipedia.page(i)
data = {'title': wiki_page.title,
'url': wiki_page.url,
'summary': wiki_page.summary,
'pageid': wiki_page.pageid}
data_list.append(data)
df = pd.DataFrame(data_list)
df
Output:
title url summary pageid
0 Kalim Aajiz https://en.wikipedia.org/wiki/Kalim_Aajiz Kalim Aajiz (1920 – 14 February 2015) was an I... 47137025
1 Robert Thurman https://en.wikipedia.org/wiki/Robert_Thurman Robert Alexander Farrar Thurman (born August 3... 475367
2 Ruskin Bond https://en.wikipedia.org/wiki/Ruskin_Bond Ruskin Bond (born 19 May 1934) is an Anglo Ind... 965456
3 Haldhar Nag https://en.wikipedia.org/wiki/Haldhar_Nag Dr. Haldhar Nag (born 31 March 1950) is a Samb... 29466145
4 Sucheta Dalal https://en.wikipedia.org/wiki/Sucheta_Dalal Sucheta Dalal (born 1962) is an Indian busines... 4125323
5 Padma Shri https://en.wikipedia.org/wiki/Padma_Shri Padma Shri (IAST: padma śrī), also spelled Pad... 442893
6 Vairamuthu https://en.wikipedia.org/wiki/Vairamuthu Vairamuthu Ramasamy (born 13 July 1953) is an ... 3604328
7 Sal Khan https://en.wikipedia.org/wiki/Sal_Khan Salman Amin Khan (born October 11, 1976), comm... 26464673
8 Arvind Gupta https://en.wikipedia.org/wiki/Arvind_Gupta Arvind Gupta is an Indian toy inventor and exp... 29176509
9 Rajdeep Sardesai https://en.wikipedia.org/wiki/Rajdeep_Sardesai Rajdeep Sardesai (born 24 May 1965)is an India... 1673653
You could just build a dictionary with your for loop and then create the data frame at the end.
For example:
results = wikipedia.search('Kalim_Aajiz')
data1 = {"NAME": [], "URL": [], "DESCRIPTION": [], "ID": []}
for i in results:
wiki_page = wikipedia.page(i)
data2 = wiki_page.title, wiki_page.url, wiki_page.summary, wiki_page.pageid
for key, value in zip(data1.keys(), data2):
data1[key].append(value)
df = pd.DataFrame(data)
You could set a grouped index value that would allow a pivot. Specifically np.arange(len(df))//4. Using the current index 0,1,2,3,0,1,2,3... to identify the columns for the pivot.
dfp = (
df.reset_index().assign(s=np.arange(len(df))//4).pivot(index=['s'], columns=[0])
.droplevel(0, axis=1).rename_axis(None, axis=1).rename_axis(None, axis=0)
)
dfp.columns = ['NAME','URL','DESCRIPTION','ID']
print(dfp)
Result
NAME URL DESCRIPTION ID
0 Kalim Aajiz https://en.wikipedia.org/wiki/Kalim_Aajiz Kalim Aajiz (1920 – 14 February 2015) was an I... 47137025
1 Robert Thurman https://en.wikipedia.org/wiki/Robert_Thurman Robert Alexander Farrar Thurman (born August 3... 475367
2 Ruskin Bond https://en.wikipedia.org/wiki/Ruskin_Bond Ruskin Bond (born 19 May 1934) is an Anglo Ind... 965456
Data Frame:
Unnamed: 0 date target insult tweet year
0 1 2014-10-09 thomas-frieden fool Can you believe this fool, Dr. Thomas Frieden ... 2014
1 2 2014-10-09 thomas-frieden DOPE Can you believe this fool, Dr. Thomas Frieden ... 2014
2 3 2015-06-16 politicians all talk and no action Big time in U.S. today - MAKE AMERICA GREAT AG... 2015
3 4 2015-06-24 ben-cardin It's politicians like Cardin that have destroy... Politician #SenatorCardin didn't like that I s... 2015
4 5 2015-06-24 neil-young total hypocrite For the nonbeliever, here is a photo of #Neily... 2015
I want the data frame which consists for only year with 2020 and 2021 using search and match methods.
df_filtered = df.loc[df.year.str.contains('2014|2015', regex=True) == True]
Question: How can I sort data by bins using groupby in pandas?
What I want is the following:
release_year listed_in
1920 Documentaries
1930 TV Shows
1940 TV Shows
1950 Classic Movies, Documentaries
1960 Documentaries
1970 Classic Movies, Documentaries
1980 Classic Movies, Documentaries
1990 Classic Movies, Documentaries
2000 Classic Movies, Documentaries
2010 Children & Family Movies, Classic Movies, Comedies
2020 Classic Movies, Dramas
To achieve this result I tried the following formula:
bins = [1925,1950,1960,1970,1990,2000,2010,2020]
groups = df.groupby(['listed_in', pd.cut(df.release_year, bins)])
groups.size().unstack()
It shows the following result:
release_year (1925,1950] (1950,1960] (1960,1970] (1970,1990] (1990,2000] (2000,2010] (2010, 2020]
listed_in
Action & Adventure 0 0 0 0 9 16 43
Action & Adventure, Anime Features, Children & Family Movies 0 0 0 0 0 0 1
Action & Adventure, Anime Features, Classic Movies 0 0 0 1 0 0 0
...
461 rows x 7 columns
I also tried the following formula:
df['release_year'] = df['release_year'].astype(str).str[0:2] + '0'
df.groupby('release_year')['listed_in'].apply(lambda x: x.mode().iloc[0])
The result was the following:
release_year
190 Dramas
200 Documentaries
Name: listed_in, dtype:object
Here is a sample of the dataset:
import pandas as pd
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',NaN,NaN],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
The simplest way to do this is use the first part of your code and simply make the last digit of the release_year a 0. Then you can .groupby decades and get the most popular genres for each decade i.e. the mode:
input:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',np.nan,np.nan],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
code:
df['release_year'] = df['release_year'].astype(str).str[0:3] + '0'
df = df.groupby('release_year', as_index=False)['listed_in'].apply(lambda x: x.mode().iloc[0])
df
output:
release_year listed_in
0 2010 Children & Family Movies, Comedies
Hi I’m new to pandas and struggling with a challenging problem.
I have 2 dataframes:
Df1
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
And
Df2
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. Also note the dates are written in the European format so (day/month/year).
I am able to summarize how many heroes are in each city with the line:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which gives me :
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I need to add another column that lists if the hero will be free from missions certain quarters.
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
If the hero ID in Df2 does not have a mission end date, the count should increase by one. If they do have an end date and it’s separated into
So in the end it should look like this:
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
I think I need to use the python datetime library to get the current date time. Than create a custom function which I can than apply to each row using a lambda. Something similar to the below code:
from datetime import date
today = date.today()
q1 = '05/04/2021'
q3 = '05/10/2020'
q4 = '05/01/2021'
count=0
def QuarterCount(Eid,AssignmentEnd )
if df1['Superhero ID'] == df2['SID'] :
if df2['Mission End date']<q3:
++count
return count
elif df2['Mission End date']>q3 && <q4:
++count
return count
elif df2['Mission End date']>q1:\
++count
return count
df['No. of heroes free in Q3'] = df1[].apply(lambda x(QuarterCount))
Please help me correct my syntax or logic or let me know if there is a better way to do this. Learning pandas is challenging but oddly fun. I'd appreciate any help you can provide :)
Suppose I have this dataframe df:
column1 column2 column3
amsterdam school yeah right backtic escapes sport swimming 2016
rotterdam nope yeah 2012
thehague i now i can fly no you cannot swimming rope 2010
amsterdam sport cycling in the winter makes me 2019
How do I get the sum of all characters (exclude white-space) of each row in column2 and return it to new column4 like this:
column1 column2 column3 column4
amsterdam school yeah right backtic escapes sport swimming 2016 70
rotterdam nope yeah 2012 8
thehague i now i can fly no you cannot swimming rope 2010 65
amsterdam sport cycling in the winter makes me 2019 55
I tried this code but so far in return I got the sum of all characters of every row in column2:
df['column4'] = sum(list(map(lambda x : sum(len(y) for y in x.split()), df['column2'])))
so currently my df look like this:
column1 column2 column3 column4
amsterdam school yeah right backtic escapes sport swimming 2016 250
rotterdam nope yeah 2012 250
thehague i now i can fly no you cannot swimming rope 2010 250
amsterdam sport cycling in the winter makes me 2019 250
anybody have idea?
Use custom lambda function with your solution:
df['column4'] = df['column2'].apply(lambda x: sum(len(y) for y in x.split()))
Or get count of all values and subtract count of whitespaces by Series.str.count:
df['column4'] = df['column2'].str.len().sub(df['column2'].str.count(' '))
#rewritten to custom functon
#df['column4'] = df['column2'].map(lambda x: len(x) - x.count(' '))
print (df)
column1 column2 column3 \
0 amsterdam school yeah right backtic escapes sport swimming 2016
1 rotterdam nope yeah 2012
2 thehague i now i can fly no you cannot swimming rope 2010
3 amsterdam sport cycling in the winter makes me 2019
column4
0 42
1 8
2 34
3 30
Hi This works for me,
import pandas as pd
df=pd.DataFrame({'col1':['Stack Overflow','The Guy']})
df['Count Of Chars']=df['col1'].str.replace(" ","").apply(len)
df
Output
col1 Count Of characters
0 Stack Overflow 13
1 The Guy 6
You can use the method count with a regular expression pattern:
df['column2'].str.count(pat='\w')
Output:
0 42
1 8
2 34
3 30
Name: column2, dtype: int64