parse text into different columns in pandas - python

I have a dataframe containing the query part of multiple urls.
For eg.
in=2015-09-19&stars_4=yes&min=4&a=3&city=New+York,+NY,+United+States&out=2015-09-20&search=1\n
in=2015-09-14&stars_3=yes&min=4&a=3&city=London,+United+Kingdom&out=2015-09-15&search=1\n
in=2015-09-26&Filter=175&min=5&a=2&city=New+York,+NY,+United+States&out=2015-09-27&search=2\n
My desired dataframe should be:
in Filter stars min a max city country out search
--------------------------------------------------------------------------------
2015-09-19 NAN stars_4 4 3 NAN NY US 2015-09-20 1
2015-09-14 NAN stars_3 4 3 NAN LONDON UK 2015-09-15 1
2015-09-26 175 NAN 5 2 NAN NY US 2015-09-27 2
Is there any easy way out for this using regex?
Any help will be much appreciated! Thanks in advance!

A quick-and-dirty fix would be to just use list comprehensions:
json_data = [{c[0]:c[1] for c in [b.split('=') for b in line.split('&')]} \
for line in open('data_file.txt')]
df = pd.DataFrame.from_records(json_data)
This won't solve your location classification issues, but will get you a better dataframe from which to work.

Related

How can I use groupby to merge rows in Pandas?

I have a dataframe that looks like this:
ID
Name
Major1
Major2
Major3
12
Dave
English
NaN
NaN
12
Dave
NaN
Biology
NaN
12
Dave
NaN
NaN
History
13
Nate
Spanish
NaN
NaN
13
Nate
NaN
Business
NaN
I need to merge rows resulting in this:
ID
Name
Major1
Major2
Major3
12
Dave
English
Biology
History
13
Nate
Spanish
Business
NaN
I know this is possible with groupby but I haven't been able to get it to work correctly. Can anyone help?
If you are intent on using groupby, you could do something like this:
dataframe = dataframe.melt(['ID', 'Name']).dropna()
dataframe = dataframe.groupby(['ID', 'Name', 'variable'])['value'].sum().unstack('variable')
You may have to mess with the column names a bit, but this is what comes to me as a possible solution using groupby.
Use melt and pivot
>>> df.melt(['ID', 'Name']).dropna() \
.pivot(['ID', 'Name'], 'variable', 'value') \
.reset_index().rename_axis(columns=None)
ID Name Major1 Major2 Major3
0 12 Dave English Biology History
1 13 Nate Spanish Business NaN

How to add a word to the end of each string in a specific column (pandas dataframe)

I want to add "NSW" to the end of each town name in a pandas data frame.The dataframe currently looks like this:
0 Parkes NaN
1 Forbes NaN
2 Yanco NaN
3 Orange NaN
4 Narara NaN
5 Wyong NaN
I need every town to also have the word NSW added to it
Try with
df['Name'] = df['Name'] + 'NSW'

Pandas read_html returned column with NaN values in Python

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)
Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.
Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.
Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object

Chained conditional count in Pandas

I have a dataframe that looks at how a form has been filled out. Here's an example:
ID Name Postcode Street Employer Salary
1 John NaN Craven Road NaN NaN
2 Sue TD2 NAN NaN 15000
3 Jimmy MW6 Blake Street Bank 40000
4 Laura QE2 Mill Lane NaN 20000
5 Sam NW2 Duke Avenue Farms 35000
6 Jordan SE6 NaN NaN NaN
7 NaN CB2 NaN Startup NaN `
I want to return a count of successively filled out columns on the condition that all previous columns have been filled. The final output should look something like:
Name Postcode Street Employer salary
6 5 3 2 2
Is there a good Pandas way of doing this? I suppose there could be a way of applying a mask so that if any previous boolean is given as zero the current column is also zero and then counting that but I'm not sure if that is the best way.
Thanks!
I think you can use notnull and cummin:
In [99]: df.notnull().cummin(axis=1).sum(axis=0)
Out[99]:
Name 6
Postcode 5
Street 3
Employer 2
Salary 2
dtype: int64
Although note that I had to replace your NAN (Sue's street) with a float NaN before I did that, and I assumed that ID was your index.
The cumulative minimum is one way to implement "applying a mask so that if any previous boolean is given as zero the current column is also zero", as you predicted would work.
Maybe cumprod BTW you have 'NAN' in your df, I try then as notnull here
df.notnull().cumprod(1).sum()
Out[59]:
ID 7
Name 6
Postcode 5
Street 4
Employer 2
Salary 2
dtype: int64

python pandas merge two or more lines of text into one line

I have data frame with text data like below,
name | address | number
1 Bob bob No.56
2 #gmail.com
3 Carly carly#world.com No.90
4 Gorge greg#yahoo
5 .com
6 No.100
and want to make it like this frame.
name | address | number
1 Bob bob#gmail.com No.56
2 Carly carly#world.com No.90
3 Gorge greg#yahoo.com No.100
I am using pandas to read file but not sure how to use merge or concat.
In case of name column consists of unique values,
print df
name address number
0 Bob bob No.56
1 NaN #gmail.com NaN
2 Carly carly#world.com No.90
3 Gorge greg#yahoo NaN
4 NaN .com NaN
5 NaN NaN No.100
df['name'] = df['name'].ffill()
print df.fillna('').groupby(['name'], as_index=False).sum()
name address number
0 Bob bob#gmail.com No.56
1 Carly carly#world.com No.90
2 Gorge greg#yahoo.com No.100
you may need ffill(), bfill(), [::-1], .groupby('name').apply(lambda x: ' '.join(x['address'])), strip(), lstrip(), rstrip(), replace() kind of thing to extend above code to more complicated data.
If you want to convert a data frame of sex rows (with possible NaN entry in each column), there might be no direct pandas methods for that.
You will need some codes to assign the value in name column, so that pandas can know the split rows of bob and #gmail.com belong to same user Bob.
You can fill each empty entry in column name with its preceding user using the fillna or ffill methods, see pandas dataframe missing data.
df ['name'] = df['name'].ffill()
# gives
name address number
0 Bob bob No.56
1 Bob #gmail.com
2 Carly carly#world.com No.90
3 Gorge greg#yahoo
4 Gorge .com
5 Gorge No.100
Then you can use the groupby and sum as the aggregation function.
df.groupby(['name']).sum().reset_index()
# gives
name address number
0 Bob bob#gmail.com No.56
1 Carly carly#world.com No.90
2 Gorge greg#yahoo.com No.100
You may find converting between NaN and white space useful, see Replacing blank values (white space) with NaN in pandas and pandas.DataFrame.fillna.

Categories

Resources