How to split a csv file row to columns in python? - python

Sorry, I am new to python. I have a csv file that gets data from google trends and writes to that file. However the output is all written to a same column. I want the date on column A and Bitcoin on column B and Cyptocurrency on column C and so on. I am really struggling with the simple task. Can any one help please? Thanks.
Below is the sample of the csv file.
"date Bitcoin Cryptocurrency Crypto isPartial"
"2013-10-27 5 0 0 False"
"2013-11-03 5 0 0 False"
"2013-11-10 5 0 0 False"
"2013-11-17 12 0 0 False"
"2013-11-24 14 0 0 False"
"2013-12-01 13 0 0 False"
This is my code to generate the file
#login
pytrend = TrendReq(google_username,google_password)
pytrend = TrendReq()
#Payload
pytrend.build_payload(kw_list=['Bitcoin','Cryptocurrency','Crypto'])
#interest over time
interest_over_time_df = pytrend.interest_over_time()
df = pd.DataFrame(interest_over_time_df)
file_name = "/Users/username/Desktop/Bitcoin.csv"
df.to_csv(file_name, sep='\t')

here you go. You will need pandas to load into a dataframe.
import pandas as pd
dataframe= pd.read_csv('Bitcoin.csv',delimiter=r"\s+")
dataframe

First of all, take a look at the CSV documentation for python, this should give you all the info and examples you need
Then I understand you want to write your rows as CSV separated by tabs so something like this should work for you:
# First you create a csv.Writer
spamwriter = csv.writer(csvfile, delimiter='\t')
# You write a row as a list into the csv.writer
spamwriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])

I was able find a few idea from other posts. Option 1 is simply just using formatting to make it look nice while Option 2 utilizes PrettyTable to give nice and formatted answer. You can find Pretty Table documenation here
Option 1 comes this previous post. All you would have to do is play around with the numbers so that the spacing is looks good enough to make you happy and of course change the file name to match your csv file.
Option 1
You could use format to left justify your output. For example,
f = open("contactlist.csv")
csv_f = csv.reader(f)
for row in csv_f:
print('{:<15} {:<15} {:<20} {:<25}'.format(*row))
Output:
Name Phone Company Email
Elon Musk 454-6723 SpaceX emusk#spacex.com
Larry Page 853-0653 Google lpage#gmail.com
Tim Cook 133-0419 Apple tcook#apple.com
Steve Ballmer 456-7893 Developers! sballmer#bluescreen.com
You can read more about format here. The < symbol left-aligns the text, and the number specifies the width of the string. Each {} can include a positional argument before the colon : - if they are omitted, the strings will appear in the order of the arguments in the unpacked list row.
Option 2
Option 2 I was able to find this information from here, Python Pretty Table
This page give you multitude of ways for solving this problem. Inlcuding a very simple of way by using the from_csv() function that can be imported from PrettyTable by using from prettytable import from_csv. Look at the example below for better insight.
Example:
Data.csv
"City name", "Area", "Population", "Annual Rainfall"
"Adelaide", 1295, 1158259, 600.5
"Brisbane", 5905, 1857594, 1146.4
"Darwin", 112, 120900, 1714.7
"Hobart", 1357, 205556, 619.5
"Sydney", 2058, 4336374, 1214.8
"Melbourne", 1566, 3806092, 646.9
"Perth", 5386, 1554769, 869.4
Python Code:
#!/usr/bin/python3
from prettytable import from_csv
with open("data.csv", "r") as fp:
x = from_csv(fp)
print(x)
Output will look something like the following:
+-----------+------+------------+-----------------+
| City name | Area | Population | Annual Rainfall |
+-----------+------+------------+-----------------+
| Adelaide | 1295 | 1158259 | 600.5 |
| Brisbane | 5905 | 1857594 | 1146.4 |
| Darwin | 112 | 120900 | 1714.7 |
| Hobart | 1357 | 205556 | 619.5 |
| Sydney | 2058 | 4336374 | 1214.8 |
| Melbourne | 1566 | 3806092 | 646.9 |
| Perth | 5386 | 1554769 | 869.4 |
+-----------+------+------------+-----------------+
Please let me know if this was beneficial by leaving a comment or casting a vote, thank you!

Related

CSV file to an array to a table? (Python 3.10.4)

I'm new to Python and doing some project based learning.
I have a CSV file that I've put into an array but I'd like present it in PrettyTable
Here's what I have so far:
import csv
import numpy as np
with open('destiny.csv', 'r') as f:
data = list(csv.reader(f, delimiter=";"))
data = np.array(data)
Output is this:
['Loud Lullaby,Aggressive,Moon,Kinetic,120,Legendary,hand_cannon']
['Pribina-D,Aggressive,Gunsmith,Kinetic,120,Legendary,hand_cannon']
['True Prophecy,Aggressive,World,Kinetic,120,Legendary,hand_cannon']
['Igneous Hammer,Aggressive,Trials,Solar,120,Legendary,hand_cannon']
But I'd like to get it into this:
from prettytable import PrettyTable
myTable = PrettyTable(['Gun Name', 'Archetype', 'Source', 'Element', 'Rounds Per Minute', 'Rarity', 'Weapon Type'])
myTable.add_row(['Loud Lullaby', 'Aggressive', 'Moon', 'Kinetic', '120', 'Legendary', 'Hand Cannon'])
myTable.add_row(["Pribina-D", "Aggressive", "Gunsmith", "Kinetic", "120", "Legendary", "Hand Cannon"])
myTable.add_row(["True Prophecy", "Aggressive", "World", "Kinetic", "120", "Legendary", "Hand Cannon"])
myTable.add_row(["Igneous Hammer", "Aggressive", "Trials", "Solar", "120", "Legendary", "Hand Cannon"])
So it can look like this:
Gun Name | Archetype | Source | Element | Rounds Per Minute | Rarity | Weapon Type |
+---------------------------------+--------------+---------------+---------+-------------------+-----------+-------------+
| Loud Lullaby | Aggressive | Moon | Kinetic | 120 | Legendary | Hand Cannon |
| Pribina-D | Aggressive | Gunsmith | Kinetic | 120 | Legendary | Hand Cannon |
| True Prophecy | Aggressive | World | Kinetic | 120 | Legendary | Hand Cannon |
| Igneous Hammer | Aggressive | Trials | Solar | 120 | Legendary | Hand Cannon |
Thoughts on the best way to get the data set incorporated into the table without having to copy and paste every line into myTable.add_row? Because there's hundreds of lines...
[Credit to vishwasrao99 at Kaggle for this CSV file]
I just combined your two pieces of script:
import csv
import numpy as np
from prettytable import PrettyTable
with open('destiny.csv', 'r') as f:
data = list(csv.reader(f, delimiter=";"))
data = np.array(data)
columns = ['Gun Name', 'Archetype', 'Source', 'Element', 'Rounds Per Minute', 'Rarity', 'Weapon Type']
myTable = PrettyTable(columns)
for row in data:
list = row[0].split(",")
myTable.add_row(list)
print(myTable)
Note that I used split(",") to split the strings you get in your numpy array at every comma, creating identical lists as what you feed in manually in your example.

How to extract desired sections from a JSON string

I want to know how to clean up my data to better understand it so that I can know how to sift through the data more easily. So far I have been able to download a public google spreadsheets doc and then convert that into a csv file. But when I print the data it is quite messy and hard to understand. The data came from a website, so when I go to google developer mode I can see how it is neatly organized.
Like this:
Website data on inspect page mode
But actually seeing it as I print into in Jupyter notebooks it looks messy like this:
b'/O_o/\ngoogle.visualization.Query.setResponse({"version":"0.6","reqId":"0output=csv","status":"ok","sig":"1241529276","table":{"cols":[{"id":"A","label":"Entity","type":"string"},{"id":"B","label":"Week","type":"number","pattern":"General"},{"id":"C","label":"Day","type":"date","pattern":"yyyy-mm-dd"},{"id":"D","label":"Flights
2019
(Reference)","type":"number","pattern":"General"},{"id":"E","label":"Flights","type":"number","pattern":"General"},{"id":"F","label":"%
vs 2019
(Daily)","type":"number","pattern":"General"},{"id":"G","label":"Flights
(7-day moving
average)","type":"number","pattern":"General"},{"id":"H","label":"% vs
2019 (7-day Moving
Average)","type":"number","pattern":"General"},{"id":"I","label":"Day
2019","type":"date","pattern":"yyyy-mm-dd"},{"id":"J","label":"Day
Previous
Year","type":"date","pattern":"yyyy-mm-dd"},{"id":"K","label":"Flights
Previous
Year","type":"number","pattern":"General"}],"rows":[{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,2)","f":"2020-09-02"},{"v":92.0,"f":"92"},{"v":59.0,"f":"59"},{"v":-0.358695652173913,"f":"-0,3586956522"},{"v":70.0,"f":"70"},{"v":-0.300998573466476,"f":"-0,3009985735"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":92.0,"f":"92"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,3)","f":"2020-09-03"},{"v":96.0,"f":"96"},{"v":67.0,"f":"67"},{"v":-0.302083333333333,"f":"-0,3020833333"},
Is there a Panda way to keep this data up?
Essentially what I am trying to do is extract three variables from the data: country, date, and a number.
Here it can be seen how the code starts out with the title, "rows":
Code in Jupyter showing how the code starts out
Essentially it gives a country, date, then a bunch of associated numbers.
What I want to get is the country name, a specific date, and a specific number.
For example, here is an example section, this sequence is repeated throughout the data:
{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},
of this section of the data I only want to get out the word Country name: Albania, the date "2020-09-01", and the number -0.5038
Here is the code I used to grab the google spreadsheet data and save it as a csv:
import requests
import pandas as pd
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=csv')
data = r.content
print(data)
Please any and all advice would be amazing.
Thank you
I'm not sure how you arrived at this csv file, but the easiest way would be to get the json directly with requests, load it as a dict and process it. Nonetheless a solution for the current file would be:
import requests
import pandas as pd
import json
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=jspn')
data = r.content
data = json.loads(data.decode('utf-8').split("(", 1)[1].rsplit(")", 1)[0]) # clean up the string so only the json data is left
d = [[i['c'][0]['v'], i['c'][2]['f'], i['c'][5]['v']] for i in data['table']['rows']]
df = pd.DataFrame(d, columns=['country', 'date', 'number'])
Output:
| | country | date | number |
|---:|:----------|:-----------|--------------:|
| 0 | Albania | 2020-09-01 | -0.503876 |
| 1 | Albania | 2020-09-02 | -0.358696 |
| 2 | Albania | 2020-09-03 | -0.302083 |
| 3 | Albania | 2020-09-04 | -0.135922 |
| 4 | Albania | 2020-09-05 | -0.43617 |

Pyspark : How to escape backslash ( \ ) in input file

I am loading a csv file into postgresql using pyspark. I have a record in the input file which looks like below -
Id,dept,city,name,country,state
1234,ABC,dallas,markhenry\,USA,texas
When I load it into the postgresql database then it gets loaded like this which is not correct -
Id | dept| city | name | country | state
1234 | ABC | dallas | markhenry,USA | texas | null
correct output in postgresdb should be -
Id | dept| city | name | country | state
1234 | ABC | dallas | markhenry | USA | texas
I am reading the file like below -
input_df = spark.read.format("csv").option("quote", "\"").option("escape", "\"").option("header",
"true").load(filepath)
Is there a way I can modify my code to handle the backslash () coming in the data. Thanks in advance
The purpose of the "quote" option is to specify a quote character, which wraps entire column values. Not sure if that is needed here, but you can use the regexp_replace function to remove specific characters (just select everything else as-is and modify the name column this way).
from pyspark.sql.functions import *
df = spark.read.option("inferSchema", "true").option("header", "true").csv(filepath)
df2 = df.select(col("Id"), col("dept"), col("city"), regexp_replace(col("name"), "\\\\", "").alias("name"), col("country"), col("state"))
df2.show(4, False)
Output:
+----+----+------+---------+-------+-----+
|Id |dept|city |name |country|state|
+----+----+------+---------+-------+-----+
|1234|ABC |dallas|markhenry|USA |texas|
+----+----+------+---------+-------+-----+

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!
changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

string manipulation, data wrangling, regex

I have a .txt file of 3 million rows. The file contains data that looks like this:
# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com
In the above data, there is a code associated with all domains listed beneath it.
I want to turn the above data into a format that can be loaded into a HiveQL/SQL. The HiveQL table should look like:
+--------------------+--------------+-------------+-----------------------------------------------------+
| domain_name | period_count | parsed_code | raw_code |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test | 0 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+
Please note that I do not want the vertical bars in any output. They are just to make the above look like a table
I'm guessing that creating a HiveQL table like the above will involve converting the .txt into a .csv or a Pandas data frame. If creating a .csv, then the .csv would probably look like:
domain_name,period_count,parsed_code,raw_code
test,0,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-1-hub.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.zzzy1129.cn,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-il.ml,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.005verf-desj.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.01accesfunds.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
I'd be interested in a Python solution, but lack familiarity with the packages and functions necessary to complete the above data wrangling steps. I'm looking for a complete solution, or code tidbits to construct my own solution. I'm guessing regular expressions will be needed to identify the "category" or "code" line in the raw data. They always start with ":127.0.1." I'd also like to parse the code out to create a parsed_code column, and a period_count column that counts the number of periods in the domain_name string. For testing purposes, please create a .txt of the sample data I have provided at the beginning of this post.
Regardless of how you want to format in the end, I suppose the first step is to separate the domain_name and code. That part is pure python
rows = []
code = None
parsed_code = None
with open('input.txt', 'r') as f:
for line in f:
line = line.rstrip('\n')
if line.startswith(':127'):
code = line
parsed_code = line.split(':')[1]
continue
if line.startswith('#'):
continue
period_count = line.count('.')
rows.append((line,period_count,parsed_code, code))
Just for illustration, you can use pandas to format the data nicely as tables, which might help if you want to pipe this to SQL, but it's not absolutely necessary. Post-processing of strings are also quite straightforward in pandas.
import pandas as pd
df = pd.DataFrame(rows, columns=['domain_name', 'period_count', 'parsed_code', 'raw_code'])
print (df)
prints this:
domain_name period_count parsed_code raw_code
0 test 0 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
1 .0-0m5tk.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
2 .0-1-hub.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
3 .zzzy1129.cn 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
4 .0-il.ml 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
5 .005verf-desj.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
6 .01accesfunds.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
You can do all of this with the Python standard library.
HEADER = "domain_name | code"
# Open files
with open("input.txt") as f_in, open("output.txt", "w") as f_out:
# Write header
print(HEADER, file=f_out)
print("-" * len(HEADER), file=f_out)
# Parse file and output in correct format
code = None
for line in f_in:
if line.startswith("#"):
# Ignore comments
continue
if line.endswith("$"):
# Store line as the current "code"
code = line
else:
# Write these domain_name entries into the
# output file separated by ' | '
print(line, code, sep=" | ", file=f_out)

Categories

Resources