Apply method in Pandas can not handle a function

Apply method in Pandas can not handle a function - python

I am new to pandas. The following is a sub_set of a dataframe named news:
Id is the id of news and the text column includes the news:
Id text
1 the news is really bad.
2 I do not have any courses.
3 Asthma is very prevalent.
4 depression causes disability.
I am going to calculate sentiment for each news in the "text" column.
I need to create a column to include the result of sentiment analysis.
This is my code:
from textblob import TextBlob
review = TextBlob(news.loc[0,'text'])
print (review.sentiment.polarity)
This code works for just one of the news in the text column.
I also wrote this function:
def detect_sentiment(text):
blob = TextBlob(text)
return blob.sentiment.polarity
news['sentiment'] = news.text.apply(detect_sentiment)
But it has the following error:
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
Any solution?

I cannot reproduce your bug: your exact code is working perfectly fine to me using pandas==0.24.2 and Python 3.4.3:
import pandas as pd
from textblob import TextBlob
news = pd.DataFrame(["the news is really bad.",
"I do not have any courses.",
"Asthma is very prevalent.",
"depression causes disability."], columns=["text"])
def detect_sentiment(text):
blob = TextBlob(text)
return blob.sentiment.polarity
news['sentiment'] = news.text.apply(detect_sentiment)
display(news)
Result:
+----+-------------------------------+-------------+
| | text | sentiment |
|----+-------------------------------+-------------|
| 0 | the news is really bad. | -0.7 |
| 1 | I do not have any courses. | 0 |
| 2 | Asthma is very prevalent. | 0.2 |
| 3 | depression causes disability. | 0 |
+----+-------------------------------+-------------+

Related

PrettyTable Python table structure

I wanted to construct a table in the Below format using python.
Edit : Sorry for not writing the question properly
I have used PrettyTable
t = PrettyTable()
t.field_names =["TestCase Name","Avg Response", "Response time "]
But for Spanning the columns R1 and R2 I am struggling.
I am trying to add data to column Testcase Name,but TestCase Name is again adding as a column at the end.
I am trying to do using the Prettytable library
t.add_column("TestCase Name", ['', 'S-1', 'S-2'])
| Test Case Name | Avg Response | Response time |
+----------------+----------------+----------------+
| | R1 | R2 | R1 | R2 |
+----------------+------+---------+-------+--------+
| S-1 | | | | |
+----------------+------+---------+-------+--------+
| S-2 | | | | |
+--------------------------------------------------+```
Thank You

If you want to display the table in the terminal/console. See https://pypi.org/project/tabulate/ or https://pypi.org/project/prettytable/
Although, I’ve only ever used tabulate so can only recommend that one.
If you want proper data visualisation reports with complex data structures. I’d probably go with using NumPy and/or Pandas.

yeah have a look at https://pypi.org/project/tabulate/
and if you wanna use it just do
pip install tabulate
in cmd

Python count hashtag per platform

My data is organized in a data frame with the following structure
| ID | Post | Platform |
| -------- | ------------------- | ----------- |
| 1 | Something #hashtag1 | Twitter |
| 2 | Something #hashtag2 | Insta |
| 3 | Something #hashtag1 | Twitter |
I have been able to extract and count the hashtag using the following (using this post):
df.Post.str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')
I am now trying to count hashtag operation occurrence from each platform. I am trying the following:
df.groupby(['Post', 'Platform'])['Post'].str.extractall(r'(\#\w+)')[0].value_counts().rename_axis('hashtags').reset_index(name='count')
But, I am getting the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'str'

We can solve this easily using 2 steps.Assumption each post has just single hashtag
Step 1: Create a new column with Hashtag
df['hashtag']= df.Post.str.extractall(r'(\#\w+)')[0].reset_index()[0]
Step 2: Group by and get the counts
df.groupby([ 'Platform']).hashtag.count()
Generic Solutions Works for any number of hashtag
We can solve this easily using 2 steps.
# extract all hashtag
df1 = df.Post.str.extractall(r'(\#\w+)')[0].reset_index()
# Ste index as index of original tagle where hash tag came from
df1.set_index('level_0',inplace = True)
df1.rename(columns={0:'hashtag'},inplace = True)
df2 = pd.merge(df,df1,right_index = True, left_index = True)
df2.groupby([ 'Platform']).hashtag.count()

How to extract desired sections from a JSON string

I want to know how to clean up my data to better understand it so that I can know how to sift through the data more easily. So far I have been able to download a public google spreadsheets doc and then convert that into a csv file. But when I print the data it is quite messy and hard to understand. The data came from a website, so when I go to google developer mode I can see how it is neatly organized.
Like this:
Website data on inspect page mode
But actually seeing it as I print into in Jupyter notebooks it looks messy like this:
b'/O_o/\ngoogle.visualization.Query.setResponse({"version":"0.6","reqId":"0output=csv","status":"ok","sig":"1241529276","table":{"cols":[{"id":"A","label":"Entity","type":"string"},{"id":"B","label":"Week","type":"number","pattern":"General"},{"id":"C","label":"Day","type":"date","pattern":"yyyy-mm-dd"},{"id":"D","label":"Flights
2019
(Reference)","type":"number","pattern":"General"},{"id":"E","label":"Flights","type":"number","pattern":"General"},{"id":"F","label":"%
vs 2019
(Daily)","type":"number","pattern":"General"},{"id":"G","label":"Flights
(7-day moving
average)","type":"number","pattern":"General"},{"id":"H","label":"% vs
2019 (7-day Moving
Average)","type":"number","pattern":"General"},{"id":"I","label":"Day
2019","type":"date","pattern":"yyyy-mm-dd"},{"id":"J","label":"Day
Previous
Year","type":"date","pattern":"yyyy-mm-dd"},{"id":"K","label":"Flights
Previous
Year","type":"number","pattern":"General"}],"rows":[{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,2)","f":"2020-09-02"},{"v":92.0,"f":"92"},{"v":59.0,"f":"59"},{"v":-0.358695652173913,"f":"-0,3586956522"},{"v":70.0,"f":"70"},{"v":-0.300998573466476,"f":"-0,3009985735"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":92.0,"f":"92"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,3)","f":"2020-09-03"},{"v":96.0,"f":"96"},{"v":67.0,"f":"67"},{"v":-0.302083333333333,"f":"-0,3020833333"},
Is there a Panda way to keep this data up?
Essentially what I am trying to do is extract three variables from the data: country, date, and a number.
Here it can be seen how the code starts out with the title, "rows":
Code in Jupyter showing how the code starts out
Essentially it gives a country, date, then a bunch of associated numbers.
What I want to get is the country name, a specific date, and a specific number.
For example, here is an example section, this sequence is repeated throughout the data:
{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},
of this section of the data I only want to get out the word Country name: Albania, the date "2020-09-01", and the number -0.5038
Here is the code I used to grab the google spreadsheet data and save it as a csv:
import requests
import pandas as pd
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=csv')
data = r.content
print(data)
Please any and all advice would be amazing.
Thank you

I'm not sure how you arrived at this csv file, but the easiest way would be to get the json directly with requests, load it as a dict and process it. Nonetheless a solution for the current file would be:
import requests
import pandas as pd
import json
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=jspn')
data = r.content
data = json.loads(data.decode('utf-8').split("(", 1)[1].rsplit(")", 1)[0]) # clean up the string so only the json data is left
d = [[i['c'][0]['v'], i['c'][2]['f'], i['c'][5]['v']] for i in data['table']['rows']]
df = pd.DataFrame(d, columns=['country', 'date', 'number'])
Output:
| | country | date | number |
|---:|:----------|:-----------|--------------:|
| 0 | Albania | 2020-09-01 | -0.503876 |
| 1 | Albania | 2020-09-02 | -0.358696 |
| 2 | Albania | 2020-09-03 | -0.302083 |
| 3 | Albania | 2020-09-04 | -0.135922 |
| 4 | Albania | 2020-09-05 | -0.43617 |

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!

changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

How to split a csv file row to columns in python?

Sorry, I am new to python. I have a csv file that gets data from google trends and writes to that file. However the output is all written to a same column. I want the date on column A and Bitcoin on column B and Cyptocurrency on column C and so on. I am really struggling with the simple task. Can any one help please? Thanks.
Below is the sample of the csv file.
"date Bitcoin Cryptocurrency Crypto isPartial"
"2013-10-27 5 0 0 False"
"2013-11-03 5 0 0 False"
"2013-11-10 5 0 0 False"
"2013-11-17 12 0 0 False"
"2013-11-24 14 0 0 False"
"2013-12-01 13 0 0 False"
This is my code to generate the file
#login
pytrend = TrendReq(google_username,google_password)
pytrend = TrendReq()
#Payload
pytrend.build_payload(kw_list=['Bitcoin','Cryptocurrency','Crypto'])
#interest over time
interest_over_time_df = pytrend.interest_over_time()
df = pd.DataFrame(interest_over_time_df)
file_name = "/Users/username/Desktop/Bitcoin.csv"
df.to_csv(file_name, sep='\t')

here you go. You will need pandas to load into a dataframe.
import pandas as pd
dataframe= pd.read_csv('Bitcoin.csv',delimiter=r"\s+")
dataframe

First of all, take a look at the CSV documentation for python, this should give you all the info and examples you need
Then I understand you want to write your rows as CSV separated by tabs so something like this should work for you:
# First you create a csv.Writer
spamwriter = csv.writer(csvfile, delimiter='\t')
# You write a row as a list into the csv.writer
spamwriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])

I was able find a few idea from other posts. Option 1 is simply just using formatting to make it look nice while Option 2 utilizes PrettyTable to give nice and formatted answer. You can find Pretty Table documenation here
Option 1 comes this previous post. All you would have to do is play around with the numbers so that the spacing is looks good enough to make you happy and of course change the file name to match your csv file.
Option 1
You could use format to left justify your output. For example,
f = open("contactlist.csv")
csv_f = csv.reader(f)
for row in csv_f:
print('{:<15} {:<15} {:<20} {:<25}'.format(*row))
Output:
Name Phone Company Email
Elon Musk 454-6723 SpaceX emusk#spacex.com
Larry Page 853-0653 Google lpage#gmail.com
Tim Cook 133-0419 Apple tcook#apple.com
Steve Ballmer 456-7893 Developers! sballmer#bluescreen.com
You can read more about format here. The < symbol left-aligns the text, and the number specifies the width of the string. Each {} can include a positional argument before the colon : - if they are omitted, the strings will appear in the order of the arguments in the unpacked list row.
Option 2
Option 2 I was able to find this information from here, Python Pretty Table
This page give you multitude of ways for solving this problem. Inlcuding a very simple of way by using the from_csv() function that can be imported from PrettyTable by using from prettytable import from_csv. Look at the example below for better insight.
Example:
Data.csv
"City name", "Area", "Population", "Annual Rainfall"
"Adelaide", 1295, 1158259, 600.5
"Brisbane", 5905, 1857594, 1146.4
"Darwin", 112, 120900, 1714.7
"Hobart", 1357, 205556, 619.5
"Sydney", 2058, 4336374, 1214.8
"Melbourne", 1566, 3806092, 646.9
"Perth", 5386, 1554769, 869.4
Python Code:
#!/usr/bin/python3
from prettytable import from_csv
with open("data.csv", "r") as fp:
x = from_csv(fp)
print(x)
Output will look something like the following:
+-----------+------+------------+-----------------+
| City name | Area | Population | Annual Rainfall |
+-----------+------+------------+-----------------+
| Adelaide | 1295 | 1158259 | 600.5 |
| Brisbane | 5905 | 1857594 | 1146.4 |
| Darwin | 112 | 120900 | 1714.7 |
| Hobart | 1357 | 205556 | 619.5 |
| Sydney | 2058 | 4336374 | 1214.8 |
| Melbourne | 1566 | 3806092 | 646.9 |
| Perth | 5386 | 1554769 | 869.4 |
+-----------+------+------------+-----------------+
Please let me know if this was beneficial by leaving a comment or casting a vote, thank you!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apply method in Pandas can not handle a function - python

Related

PrettyTable Python table structure

Python count hashtag per platform

How to extract desired sections from a JSON string

Pandas not displaying all columns when writing to

How to split a csv file row to columns in python?

Categories

Resources