Pandas not displaying all columns when writing to - python

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!

changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

Related

How to extract desired sections from a JSON string

I want to know how to clean up my data to better understand it so that I can know how to sift through the data more easily. So far I have been able to download a public google spreadsheets doc and then convert that into a csv file. But when I print the data it is quite messy and hard to understand. The data came from a website, so when I go to google developer mode I can see how it is neatly organized.
Like this:
Website data on inspect page mode
But actually seeing it as I print into in Jupyter notebooks it looks messy like this:
b'/O_o/\ngoogle.visualization.Query.setResponse({"version":"0.6","reqId":"0output=csv","status":"ok","sig":"1241529276","table":{"cols":[{"id":"A","label":"Entity","type":"string"},{"id":"B","label":"Week","type":"number","pattern":"General"},{"id":"C","label":"Day","type":"date","pattern":"yyyy-mm-dd"},{"id":"D","label":"Flights
2019
(Reference)","type":"number","pattern":"General"},{"id":"E","label":"Flights","type":"number","pattern":"General"},{"id":"F","label":"%
vs 2019
(Daily)","type":"number","pattern":"General"},{"id":"G","label":"Flights
(7-day moving
average)","type":"number","pattern":"General"},{"id":"H","label":"% vs
2019 (7-day Moving
Average)","type":"number","pattern":"General"},{"id":"I","label":"Day
2019","type":"date","pattern":"yyyy-mm-dd"},{"id":"J","label":"Day
Previous
Year","type":"date","pattern":"yyyy-mm-dd"},{"id":"K","label":"Flights
Previous
Year","type":"number","pattern":"General"}],"rows":[{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,2)","f":"2020-09-02"},{"v":92.0,"f":"92"},{"v":59.0,"f":"59"},{"v":-0.358695652173913,"f":"-0,3586956522"},{"v":70.0,"f":"70"},{"v":-0.300998573466476,"f":"-0,3009985735"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":92.0,"f":"92"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,3)","f":"2020-09-03"},{"v":96.0,"f":"96"},{"v":67.0,"f":"67"},{"v":-0.302083333333333,"f":"-0,3020833333"},
Is there a Panda way to keep this data up?
Essentially what I am trying to do is extract three variables from the data: country, date, and a number.
Here it can be seen how the code starts out with the title, "rows":
Code in Jupyter showing how the code starts out
Essentially it gives a country, date, then a bunch of associated numbers.
What I want to get is the country name, a specific date, and a specific number.
For example, here is an example section, this sequence is repeated throughout the data:
{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},
of this section of the data I only want to get out the word Country name: Albania, the date "2020-09-01", and the number -0.5038
Here is the code I used to grab the google spreadsheet data and save it as a csv:
import requests
import pandas as pd
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=csv')
data = r.content
print(data)
Please any and all advice would be amazing.
Thank you
I'm not sure how you arrived at this csv file, but the easiest way would be to get the json directly with requests, load it as a dict and process it. Nonetheless a solution for the current file would be:
import requests
import pandas as pd
import json
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=jspn')
data = r.content
data = json.loads(data.decode('utf-8').split("(", 1)[1].rsplit(")", 1)[0]) # clean up the string so only the json data is left
d = [[i['c'][0]['v'], i['c'][2]['f'], i['c'][5]['v']] for i in data['table']['rows']]
df = pd.DataFrame(d, columns=['country', 'date', 'number'])
Output:
| | country | date | number |
|---:|:----------|:-----------|--------------:|
| 0 | Albania | 2020-09-01 | -0.503876 |
| 1 | Albania | 2020-09-02 | -0.358696 |
| 2 | Albania | 2020-09-03 | -0.302083 |
| 3 | Albania | 2020-09-04 | -0.135922 |
| 4 | Albania | 2020-09-05 | -0.43617 |

Pyspark : How to escape backslash ( \ ) in input file

I am loading a csv file into postgresql using pyspark. I have a record in the input file which looks like below -
Id,dept,city,name,country,state
1234,ABC,dallas,markhenry\,USA,texas
When I load it into the postgresql database then it gets loaded like this which is not correct -
Id | dept| city | name | country | state
1234 | ABC | dallas | markhenry,USA | texas | null
correct output in postgresdb should be -
Id | dept| city | name | country | state
1234 | ABC | dallas | markhenry | USA | texas
I am reading the file like below -
input_df = spark.read.format("csv").option("quote", "\"").option("escape", "\"").option("header",
"true").load(filepath)
Is there a way I can modify my code to handle the backslash () coming in the data. Thanks in advance
The purpose of the "quote" option is to specify a quote character, which wraps entire column values. Not sure if that is needed here, but you can use the regexp_replace function to remove specific characters (just select everything else as-is and modify the name column this way).
from pyspark.sql.functions import *
df = spark.read.option("inferSchema", "true").option("header", "true").csv(filepath)
df2 = df.select(col("Id"), col("dept"), col("city"), regexp_replace(col("name"), "\\\\", "").alias("name"), col("country"), col("state"))
df2.show(4, False)
Output:
+----+----+------+---------+-------+-----+
|Id |dept|city |name |country|state|
+----+----+------+---------+-------+-----+
|1234|ABC |dallas|markhenry|USA |texas|
+----+----+------+---------+-------+-----+

Pairing two Pandas data frames with an ID value

I am trying to put together a useable set of data about glaciers. Our original data comes from an ArcGIS dataset, and latitude/longitude values were stored in a separate file, now detached from the CSV with all of our data. I am attempting to merge the latitude/longitude files with our data set. Heres a preview of what the files look like.
This is my main dataset file, glims (columns dropped for clarity)
| ANLYS_ID | GLAC_ID | AREA |
|----------|----------------|-------|
| 101215 | G286929E46788S | 2.401 |
| 101146 | G286929E46788S | 1.318 |
| 101162 | G286929E46788S | 0.061 |
This is the latitude-longitude file, coordinates
| lat | long | glacier_id |
|-------|---------|----------------|
| 1.187 | -70.166 | G001187E70166S |
| 2.050 | -70.629 | G002050E70629S |
| 3.299 | -54.407 | G002939E70509S |
The problem is, the coordinates data frame has one row for each glacier id with latitude longitude, whereas my glims data frame has multiple rows for each glacier id with varying data for each entry.
I need every single entry in my main data file to have a latitude-longitude value added to it, based on the matching glacier_id between the two data frames.
Heres what I've tried so far.
glims = pd.read_csv('glims_clean.csv')
coordinates = pd.read_csv('LatLong_GLIMS.csv')
df['que'] = np.where((coordinates['glacier_id'] ==
glims['GLAC_ID']))
error returns: 'int' object is not subscriptable
and:
glims.merge(coordinates, how='right', on=('glacier_id', 'GLAC_ID'))
error returns: int' object has no attribute 'merge'
I have no idea how to tackle this big of a merge. I am also afraid of making mistakes because it is nearly impossible to catch them, since the data carries no other identifying factors.
Any guidance would be awesome, thank you.
This should work
glims = glims.merge(coordinates, how='left', left_on='GLAC_ID', right_on='glacier_id')
This a classic merging problem. One way to solve is using straight loc and index-matching
glims = glims.set_index('GLAC_ID')
glims.loc[:, 'lat'] = coord.set_index('glacier_id').lat
glims.loc[:, 'long'] = coord.set_index('glacier_id').long
glims = glims.reset_index()
You can also use pd.merge
pd.merge(glims,
coord.rename(columns={'glacier_id': 'GLAC_ID'}),
on='GLAC_ID')

How to split a csv file row to columns in python?

Sorry, I am new to python. I have a csv file that gets data from google trends and writes to that file. However the output is all written to a same column. I want the date on column A and Bitcoin on column B and Cyptocurrency on column C and so on. I am really struggling with the simple task. Can any one help please? Thanks.
Below is the sample of the csv file.
"date Bitcoin Cryptocurrency Crypto isPartial"
"2013-10-27 5 0 0 False"
"2013-11-03 5 0 0 False"
"2013-11-10 5 0 0 False"
"2013-11-17 12 0 0 False"
"2013-11-24 14 0 0 False"
"2013-12-01 13 0 0 False"
This is my code to generate the file
#login
pytrend = TrendReq(google_username,google_password)
pytrend = TrendReq()
#Payload
pytrend.build_payload(kw_list=['Bitcoin','Cryptocurrency','Crypto'])
#interest over time
interest_over_time_df = pytrend.interest_over_time()
df = pd.DataFrame(interest_over_time_df)
file_name = "/Users/username/Desktop/Bitcoin.csv"
df.to_csv(file_name, sep='\t')
here you go. You will need pandas to load into a dataframe.
import pandas as pd
dataframe= pd.read_csv('Bitcoin.csv',delimiter=r"\s+")
dataframe
First of all, take a look at the CSV documentation for python, this should give you all the info and examples you need
Then I understand you want to write your rows as CSV separated by tabs so something like this should work for you:
# First you create a csv.Writer
spamwriter = csv.writer(csvfile, delimiter='\t')
# You write a row as a list into the csv.writer
spamwriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])
I was able find a few idea from other posts. Option 1 is simply just using formatting to make it look nice while Option 2 utilizes PrettyTable to give nice and formatted answer. You can find Pretty Table documenation here
Option 1 comes this previous post. All you would have to do is play around with the numbers so that the spacing is looks good enough to make you happy and of course change the file name to match your csv file.
Option 1
You could use format to left justify your output. For example,
f = open("contactlist.csv")
csv_f = csv.reader(f)
for row in csv_f:
print('{:<15} {:<15} {:<20} {:<25}'.format(*row))
Output:
Name Phone Company Email
Elon Musk 454-6723 SpaceX emusk#spacex.com
Larry Page 853-0653 Google lpage#gmail.com
Tim Cook 133-0419 Apple tcook#apple.com
Steve Ballmer 456-7893 Developers! sballmer#bluescreen.com
You can read more about format here. The < symbol left-aligns the text, and the number specifies the width of the string. Each {} can include a positional argument before the colon : - if they are omitted, the strings will appear in the order of the arguments in the unpacked list row.
Option 2
Option 2 I was able to find this information from here, Python Pretty Table
This page give you multitude of ways for solving this problem. Inlcuding a very simple of way by using the from_csv() function that can be imported from PrettyTable by using from prettytable import from_csv. Look at the example below for better insight.
Example:
Data.csv
"City name", "Area", "Population", "Annual Rainfall"
"Adelaide", 1295, 1158259, 600.5
"Brisbane", 5905, 1857594, 1146.4
"Darwin", 112, 120900, 1714.7
"Hobart", 1357, 205556, 619.5
"Sydney", 2058, 4336374, 1214.8
"Melbourne", 1566, 3806092, 646.9
"Perth", 5386, 1554769, 869.4
Python Code:
#!/usr/bin/python3
from prettytable import from_csv
with open("data.csv", "r") as fp:
x = from_csv(fp)
print(x)
Output will look something like the following:
+-----------+------+------------+-----------------+
| City name | Area | Population | Annual Rainfall |
+-----------+------+------------+-----------------+
| Adelaide | 1295 | 1158259 | 600.5 |
| Brisbane | 5905 | 1857594 | 1146.4 |
| Darwin | 112 | 120900 | 1714.7 |
| Hobart | 1357 | 205556 | 619.5 |
| Sydney | 2058 | 4336374 | 1214.8 |
| Melbourne | 1566 | 3806092 | 646.9 |
| Perth | 5386 | 1554769 | 869.4 |
+-----------+------+------------+-----------------+
Please let me know if this was beneficial by leaving a comment or casting a vote, thank you!

Writing values to excel in python using pandas

I'm new to python and would like to pass the ZipCode in excel file to 'uszipcode' package and write the state for that particular zipcode to 'OriginalZipcode' column in the excel sheet. The reason for doing this is, I want to compare the existing states with the original states. I don't understand if the for loop is wrong in the code or something else is. Currently, I cannot write the states to OriginalZipcode column in excel. The code I've written is:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import uszipcode as US
from uszipcode import ZipcodeSearchEngine
search = ZipcodeSearchEngine()
df = pd.read_excel("H:\excel\checking for zip and states\checkZipStates.xlsx", sheet_name='Sheet1')
#print(df.values)
for i, row in df.iterrows():
zipcode = search.by_zipcode(row['ZipCode']) #for searching zipcode
b = zipcode.State
df.at['row','OriginalState'] = b
df.to_excel("H:\\excel\\checking for zip and states\\new.xlsx", sheet_name = "compare", index = False)
The excel sheet is in this format:
| ZipCode |CurrentState | OriginalState |
|-----------|-----------------|---------------|
| 59714 | Montana | |
| 29620 | South Carolina | |
| 54405 | Wisconsin | |
| . | . | |
| . | . | |
You can add the OriginalState column without iterating the df:
Define a function that returns the value you want for any given zip code:
def get_original_state(state):
zipcode = search.by_zipcode(state) #for searching zipcode
return zipcode.State
Then:
df['OriginalState'] = df.apply( lambda row: get_original_state(row['ZipCode']), axis=1)
Finally, export only once the df to excel.
This should do the trick.

Categories

Resources