How to extract desired sections from a JSON string - python

I want to know how to clean up my data to better understand it so that I can know how to sift through the data more easily. So far I have been able to download a public google spreadsheets doc and then convert that into a csv file. But when I print the data it is quite messy and hard to understand. The data came from a website, so when I go to google developer mode I can see how it is neatly organized.
Like this:
Website data on inspect page mode
But actually seeing it as I print into in Jupyter notebooks it looks messy like this:
b'/O_o/\ngoogle.visualization.Query.setResponse({"version":"0.6","reqId":"0output=csv","status":"ok","sig":"1241529276","table":{"cols":[{"id":"A","label":"Entity","type":"string"},{"id":"B","label":"Week","type":"number","pattern":"General"},{"id":"C","label":"Day","type":"date","pattern":"yyyy-mm-dd"},{"id":"D","label":"Flights
2019
(Reference)","type":"number","pattern":"General"},{"id":"E","label":"Flights","type":"number","pattern":"General"},{"id":"F","label":"%
vs 2019
(Daily)","type":"number","pattern":"General"},{"id":"G","label":"Flights
(7-day moving
average)","type":"number","pattern":"General"},{"id":"H","label":"% vs
2019 (7-day Moving
Average)","type":"number","pattern":"General"},{"id":"I","label":"Day
2019","type":"date","pattern":"yyyy-mm-dd"},{"id":"J","label":"Day
Previous
Year","type":"date","pattern":"yyyy-mm-dd"},{"id":"K","label":"Flights
Previous
Year","type":"number","pattern":"General"}],"rows":[{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,2)","f":"2020-09-02"},{"v":92.0,"f":"92"},{"v":59.0,"f":"59"},{"v":-0.358695652173913,"f":"-0,3586956522"},{"v":70.0,"f":"70"},{"v":-0.300998573466476,"f":"-0,3009985735"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":92.0,"f":"92"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,3)","f":"2020-09-03"},{"v":96.0,"f":"96"},{"v":67.0,"f":"67"},{"v":-0.302083333333333,"f":"-0,3020833333"},
Is there a Panda way to keep this data up?
Essentially what I am trying to do is extract three variables from the data: country, date, and a number.
Here it can be seen how the code starts out with the title, "rows":
Code in Jupyter showing how the code starts out
Essentially it gives a country, date, then a bunch of associated numbers.
What I want to get is the country name, a specific date, and a specific number.
For example, here is an example section, this sequence is repeated throughout the data:
{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},
of this section of the data I only want to get out the word Country name: Albania, the date "2020-09-01", and the number -0.5038
Here is the code I used to grab the google spreadsheet data and save it as a csv:
import requests
import pandas as pd
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=csv')
data = r.content
print(data)
Please any and all advice would be amazing.
Thank you

I'm not sure how you arrived at this csv file, but the easiest way would be to get the json directly with requests, load it as a dict and process it. Nonetheless a solution for the current file would be:
import requests
import pandas as pd
import json
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=jspn')
data = r.content
data = json.loads(data.decode('utf-8').split("(", 1)[1].rsplit(")", 1)[0]) # clean up the string so only the json data is left
d = [[i['c'][0]['v'], i['c'][2]['f'], i['c'][5]['v']] for i in data['table']['rows']]
df = pd.DataFrame(d, columns=['country', 'date', 'number'])
Output:
| | country | date | number |
|---:|:----------|:-----------|--------------:|
| 0 | Albania | 2020-09-01 | -0.503876 |
| 1 | Albania | 2020-09-02 | -0.358696 |
| 2 | Albania | 2020-09-03 | -0.302083 |
| 3 | Albania | 2020-09-04 | -0.135922 |
| 4 | Albania | 2020-09-05 | -0.43617 |

Related

PrettyTable Python table structure

I wanted to construct a table in the Below format using python.
Edit : Sorry for not writing the question properly
I have used PrettyTable
t = PrettyTable()
t.field_names =["TestCase Name","Avg Response", "Response time "]
But for Spanning the columns R1 and R2 I am struggling.
I am trying to add data to column Testcase Name,but TestCase Name is again adding as a column at the end.
I am trying to do using the Prettytable library
t.add_column("TestCase Name", ['', 'S-1', 'S-2'])
| Test Case Name | Avg Response | Response time |
+----------------+----------------+----------------+
| | R1 | R2 | R1 | R2 |
+----------------+------+---------+-------+--------+
| S-1 | | | | |
+----------------+------+---------+-------+--------+
| S-2 | | | | |
+--------------------------------------------------+```
Thank You
If you want to display the table in the terminal/console. See https://pypi.org/project/tabulate/ or https://pypi.org/project/prettytable/
Although, I’ve only ever used tabulate so can only recommend that one.
If you want proper data visualisation reports with complex data structures. I’d probably go with using NumPy and/or Pandas.
yeah have a look at https://pypi.org/project/tabulate/
and if you wanna use it just do
pip install tabulate
in cmd

How to populate dataframe with values drawn from a CSV in Python

I'm trying to fill an existing spreadsheet with values from a separate CSV file with Python.
I have this long CSV file with emails and matching domains that I want to insert into a spreadsheet of business contact information. Basically, insert email into the email column where the 'website' column matches up.
The spreadsheet I'm trying to populate looks like this:
| Index | business_name | email| website |
| --- | --------------- |------| ----------------- |
| 0 | Apple | | www.apple.com |
| 1 | Home Depot | | www.home-depot.com|
| 4 | Amazon | | www.amazon.com |
| 6 | Microsoft | | www.microsoft.com |
The CSV file I'm taking contacts from looks like this:
steve#apple.com, www.apple.com
jeff#amazon.com, www.amazon.com
marc#amazon.com, www.amazon.com
john#amazon.com, www.amazon.com
marc#salesforce.com, www.salesforce.com
dan#salesforce.com, www.salesforce.com
in Python:
index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]
col1 = ["steve#apple.com", "jeff#amazon.com", "marc#amazon.com", "john#amazon.com", "marc#salesforce.com", "Dan#salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]
# spreadsheet to insert values into
spreadsheet_df = pd.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
# csv file that is read
csv_df = pd.DataFrame({"col1":col1, "col2":col2})
Desired Output:
| Index | business_name | email | website |
| --- | --------------- |---------------------| ----------------- |
| 0 | Apple | steve#apple.com | www.apple.com |
| 1 | Home Depot | NaN | www.home-depot.com|
| 4 | Amazon | jeff#amazon.com | www.amazon.com |
| 6 | Microsoft | NaN | www.microsoft.com |
I want to iterate through every row in the CVS file to find where the 2nd column (in the CSV) matches the fourth column of the spreadsheet, then insert the corresponding value from the CSV file (value in the first column) into the 3rd column of the spreadsheet.
Up until now, I've had to manually insert email contacts from the CSV file into the spreadsheet which has become very tedious. Please save me from this monotony.
I've scoured stack overflow for an identical or similar thread but cannot find one. I apologize if there is a thread with this same issue, and if my post is confusing or lacking information as it is my first. There are multiple entries for a single domain, so ideally I want to append every entry in the CSV file to its matching row and column in the spreadsheet. This seems like an easy task at first but has become a massive headache for me.
welcome to Stackoverflow, in the future please kindly follow these guidelines. In this scenario, please follow the community PANDAS guidelines as well. Following these guidelines are important in how the community can help you and how you can help the community as well.
First you need to provide and create a minimal and reproducible example for those helping you:
# Setup
index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]
col1 = ["steve#apple.com", "jeff#amazon.com", "marc#amazon.com", "john#amazon.com", "marc#salesforce.com", "Dan#salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]
# Create DataFrames
# In your code this is where you would read in the CSV and spreadsheet via pandas
spreadsheet_df = pd.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
csv_df = pd.DataFrame({"col1":col1, "col2":col2})
This will also help others who are reviewing this question in the future.
If I understand you correctly, you're looking to provide an email address for every company you have on the spread sheet:
You can accomplish it by reading in the csv and spreadsheet into a dataframe and merging them:
# Merge my two dataframes
df = spreadsheet_df.merge(csv_df, left_on="website", right_on="col2", how="left")
# Only keep the columns I want
df = df[["index", "business_name", "email", "website", "col1"]]
output:
index business_name email website col1
0 0 apple www.apple.com steve#apple.com
1 1 home depot www.home-depot.com NaN
2 4 amazon www.amazon.com jeff#amazon.com
3 4 amazon www.amazon.com marc#amazon.com
4 4 amazon www.amazon.com john#amazon.com
5 6 microsoft www.microsoft.com NaN
Because you didn't provide an expected output, I don't know if this is correct.
If you want to associate only the first email for a business in the CSV file with a website, you can do groupby/first on that and then merge with the business dataframe. I'm also going to drop the original email column since it serves no purpose
import pandas
index = [0, 1, 4, 6]
business_name = ["apple", "home depot", "amazon", "microsoft"]
email = ["" for i in range(4)]
website = ["www.apple.com", "www.home-depot.com", "www.amazon.com", "www.microsoft.com"]
col1 = ["steve#apple.com", "jeff#amazon.com", "marc#amazon.com", "john#amazon.com", "marc#salesforce.com", "Dan#salesforce.com"]
col2 = ["www.apple.com", "www.amazon.com", "www.amazon.com", "www.amazon.com", "www.salesforce.com", "www.salesforce.com"]
# spreadsheet to insert values into
business = pandas.DataFrame({"index":index, "business_name":business_name, "email":email, "website":website})
# csv file that is read
email = pandas.DataFrame({"email":col1, "website":col2})
output = (
business
.drop(columns=['email']) # this is empty and needs to be overwritten
.merge(
email.groupby('website', as_index=False).first(), # just the first email
on='website', how='left' # left-join -> keep all rows from `business`
)
.loc[:, business.columns] # get your original column order back
)
And I get:
index business_name email website
0 apple steve#apple.com www.apple.com
1 home depot NaN www.home-depot.com
4 amazon jeff#amazon.com www.amazon.com
6 microsoft NaN www.microsoft.com
Assuming that the spreadsheet is also a pandas dataframe and that it looks exactly like your image, there is a straightforward way of doing this using boolean indexing. I advise you to read about it further here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
First, I suggest that you turn your so-called CSV-file into a dictionary where the website is the key and the e-mail addresses are the values. Seeing as you don't need more than one contact, this works well. The reason I asked that question is that a dictionary cannot contain identical keys and thus some e-mail addresses would disappear. Achieving this is easily done by reading in the CSV-file as a pandas Series and doing the following:
series = pd.Series.from_csv('path_to_csv')
contacts_dict = series.to_dict()
Note that your order here would now be incorrect, in that you would have the e-mail as a key and the domain as a value. As such, you can do the following to swap them around:
dict_contacts = {value:key for key, value in dict_contacts.items()}
The reason for this step is that I believe that it it easier to work with an expanding list of clients.
Having done that, what you could simply do is then:
For i in dict_contacts.keys():
df1['e-mail'][df1['website'] == i] = dict_contacts[i] #THERE SHOULD BE AN INDENT HERE
What this does is that it filters out only the e-mail addresses for each unique key in the dictionary (i.e. the domain) and assigns it the value of that key, i.e. the e-mail.
Finally, I have deliberately attempted to provide you with a solution that is general and thus wouldn't require additional work in case you were to have 2000 different clients with unique domains and e-mails.

How can I get a count of occurrances in csv columns and save as a new csv containing the count in python

I am new to python, I would really appreciate the assistance. I trie the entire day. I have a csv file containing 10 columns. I am only interested in 3 state, county and zipcode. I am trying, trying and trying to get a count of the occurrences in each column for instance CA 20000, TX 14000, and having the count result outpute to be saved in a csv files that could be easily imported into excel and further merged with geospatial files.
I managed to select the 3 columns that I need
import numpy as np
from tabulate import tabulate
import pandas as pd
#Replace with path and name file in your computer
filename = "10column.csv"
# Enter the column number that you want to display between [, , ,] no space between the commas "usecols=[3,4,5]"
table = np.genfromtxt(filename,delimiter=',',skip_header=0,dtype='U',usecols=[4,5,6])
print(tabulate(table))
#Insert the path and name of the file
pd.DataFrame(table).to_csv("3column.csv")
Then I tried to count the occurrences but the output it is in the wrong format and I cannot save as csv.
import csv
from collections import Counter
import numpy as np
my_reader = csv.reader(open("3column.csv"))
#insert column number instead of the 2 "[rec[2]"
column = [rec[2] for rec in my_reader]
np.array([Counter(column)])
print(np.array([Counter(column)]))
the result is
[Counter({'22209': 10, '20007': 5, …'})]
I cannot save it as csv and I would like to have on a tabulated format
zip, count
22209, 10, 20007, 10
I would really appreciate your help
A different way to approach would be using value_counts() from Pandas documentation.
Return a Series containing counts of unique values.
Exemple data file 7column.csv
id,state,city,zip,ip_address,latitude,longitude
1,NY,New York City,10005,246.78.179.157,40.6964,-74.0253
2,WA,Yakima,98907,79.61.127.155,46.6288,-120.574
3,OK,Oklahoma City,73109,129.226.225.133,35.4259,-97.5261
4,FL,Orlando,32859,104.196.5.159,28.4429,-81.4026
5,NY,New York City,10004,246.78.180.157,40.6964,-74.0253
6,FL,Orlando,32860,104.196.5.159,29.4429,-81.4026
7,IL,Chicago,60641,19.226.187.13,41.9453,-87.7474
8,NC,Fayetteville,28314,45.109.1.38,35.0583,-79.008
9,IL,Chicago,60642,19.226.187.14,41.9453,-87.7474
10,WA,Yakima,98907,79.61.127.156,46.6288,-120.574
11,IL,Chicago,60643,19.226.187.15,41.9453,-87.7474
12,CA,Sacramento,94237,77.208.31.167,38.3774,-121.4444
import pandas as pd
df = pd.read_csv("7column.csv")
zipcode = df["zip"].value_counts()
state = df["state"].value_counts()
city = df["city"].value_counts()
zipcode.to_csv('zipcode_count.csv')
state.to_csv('state_count.csv')
city.to_csv('city_count.csv')
CSV output files
state_count.csv | city_count.csv | zipcode_count.csv
,state | ,city | ,zip
IL,3 | Chicago,3 | 98907,2
NY,2 | Orlando,2 | 32859,1
FL,2 | New York City,2 | 94237,1
WA,2 | Yakima,2 | 32860,1
NC,1 | Sacramento,1 | 28314,1
OK,1 | Fayetteville,1 | 10005,1
CA,1 | Oklahoma City,1 | 10004,1
| | 60643,1
| | 60642,1
| | 60641,1
| | 73109,1
You could read the file you wrote out to CSV in as a DataFrame and use the count method that Pandas has.
states_3 = pd.DataFrame(table)
state_count = states_3.count(axis='columns')
out_name = 'statecount.xlsx'
with pd.ExcelWriter(out_name) as writer:
state_count.to_excel(writer, sheet_name='counts')

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!
changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

Apply method in Pandas can not handle a function

I am new to pandas. The following is a sub_set of a dataframe named news:
Id is the id of news and the text column includes the news:
Id text
1 the news is really bad.
2 I do not have any courses.
3 Asthma is very prevalent.
4 depression causes disability.
I am going to calculate sentiment for each news in the "text" column.
I need to create a column to include the result of sentiment analysis.
This is my code:
from textblob import TextBlob
review = TextBlob(news.loc[0,'text'])
print (review.sentiment.polarity)
This code works for just one of the news in the text column.
I also wrote this function:
def detect_sentiment(text):
blob = TextBlob(text)
return blob.sentiment.polarity
news['sentiment'] = news.text.apply(detect_sentiment)
But it has the following error:
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
Any solution?
I cannot reproduce your bug: your exact code is working perfectly fine to me using pandas==0.24.2 and Python 3.4.3:
import pandas as pd
from textblob import TextBlob
news = pd.DataFrame(["the news is really bad.",
"I do not have any courses.",
"Asthma is very prevalent.",
"depression causes disability."], columns=["text"])
def detect_sentiment(text):
blob = TextBlob(text)
return blob.sentiment.polarity
news['sentiment'] = news.text.apply(detect_sentiment)
display(news)
Result:
+----+-------------------------------+-------------+
| | text | sentiment |
|----+-------------------------------+-------------|
| 0 | the news is really bad. | -0.7 |
| 1 | I do not have any courses. | 0 |
| 2 | Asthma is very prevalent. | 0.2 |
| 3 | depression causes disability. | 0 |
+----+-------------------------------+-------------+

Categories

Resources