CSV file to an array to a table? (Python 3.10.4) - python

I'm new to Python and doing some project based learning.
I have a CSV file that I've put into an array but I'd like present it in PrettyTable
Here's what I have so far:
import csv
import numpy as np
with open('destiny.csv', 'r') as f:
data = list(csv.reader(f, delimiter=";"))
data = np.array(data)
Output is this:
['Loud Lullaby,Aggressive,Moon,Kinetic,120,Legendary,hand_cannon']
['Pribina-D,Aggressive,Gunsmith,Kinetic,120,Legendary,hand_cannon']
['True Prophecy,Aggressive,World,Kinetic,120,Legendary,hand_cannon']
['Igneous Hammer,Aggressive,Trials,Solar,120,Legendary,hand_cannon']
But I'd like to get it into this:
from prettytable import PrettyTable
myTable = PrettyTable(['Gun Name', 'Archetype', 'Source', 'Element', 'Rounds Per Minute', 'Rarity', 'Weapon Type'])
myTable.add_row(['Loud Lullaby', 'Aggressive', 'Moon', 'Kinetic', '120', 'Legendary', 'Hand Cannon'])
myTable.add_row(["Pribina-D", "Aggressive", "Gunsmith", "Kinetic", "120", "Legendary", "Hand Cannon"])
myTable.add_row(["True Prophecy", "Aggressive", "World", "Kinetic", "120", "Legendary", "Hand Cannon"])
myTable.add_row(["Igneous Hammer", "Aggressive", "Trials", "Solar", "120", "Legendary", "Hand Cannon"])
So it can look like this:
Gun Name | Archetype | Source | Element | Rounds Per Minute | Rarity | Weapon Type |
+---------------------------------+--------------+---------------+---------+-------------------+-----------+-------------+
| Loud Lullaby | Aggressive | Moon | Kinetic | 120 | Legendary | Hand Cannon |
| Pribina-D | Aggressive | Gunsmith | Kinetic | 120 | Legendary | Hand Cannon |
| True Prophecy | Aggressive | World | Kinetic | 120 | Legendary | Hand Cannon |
| Igneous Hammer | Aggressive | Trials | Solar | 120 | Legendary | Hand Cannon |
Thoughts on the best way to get the data set incorporated into the table without having to copy and paste every line into myTable.add_row? Because there's hundreds of lines...
[Credit to vishwasrao99 at Kaggle for this CSV file]

I just combined your two pieces of script:
import csv
import numpy as np
from prettytable import PrettyTable
with open('destiny.csv', 'r') as f:
data = list(csv.reader(f, delimiter=";"))
data = np.array(data)
columns = ['Gun Name', 'Archetype', 'Source', 'Element', 'Rounds Per Minute', 'Rarity', 'Weapon Type']
myTable = PrettyTable(columns)
for row in data:
list = row[0].split(",")
myTable.add_row(list)
print(myTable)
Note that I used split(",") to split the strings you get in your numpy array at every comma, creating identical lists as what you feed in manually in your example.

Related

How to lookup data from one CSV in another CSV?

In the crq_data file I have cities and states from a user uploaded *.csv file
In the cityDoordinates.csv file I have a library of American cities and states along with their coordinates, I would like this to be a sort of "look up tool" to compare an uploaded .csv file to find their coordinates to map in Folium
Right now, it reads line by line so it appends the coordinates one at a time (n seconds) I would like it to run much faster so that if there are 6000 lines the user doesn't have to wait for 6000 seconds.
Here is part of my code:
crq_file = askopenfilename(filetypes=[('CSV Files', '*csv')])
crq_data = pd.read_csv(crq_file, encoding="utf8")
coords = pd.read_csv("cityCoordinates.csv")
for crq in range(len(crq_data)):
task_city = crq_data.iloc[crq]["TaskCity"]
task_state = crq_data.iloc[crq]["TaskState"]
for coordinates in range(len(coords)):
cityCoord = coords.iloc[coordinates]["City"]
stateCoord = coords.iloc[coordinates]["State"]
latCoord = coords.iloc[coordinates]["Latitude"]
lngCoord = coords.iloc[coordinates]["Longitude"]
if task_city == cityCoord and task_state == stateCoord:
crq_data["CRQ Latitude"] = latCoord
crq_data["CRQ Longitude"] = lngCoord
print(cityCoord, stateCoord, latCoord, lngCoord)
This is an example of the current Terminal Output
Example of uploaded .csv file
I see this not as a problem w/optimizing Pandas, but finding a good data structure for fast lookups: and a good data structure for fast lookups is the dict. The dict takes memory, though; you'll need to evaluate that cost for yourself.
I mocked up what your cityCoordinates CSV could look like:
| City | State | Latitude | Longitude |
|----------|-------|------------|-------------|
| Portland | OR | 45°31′12″N | 122°40′55″W |
| Dallas | TX | 32°46′45″N | 96°48′32″W |
| Portland | ME | 43°39′36″N | 70°15′18″W |
import csv
import pprint
def cs_key(city_name: str, state_name: str) -> str:
"""Make a normalized City-State key."""
return city_name.strip().lower() + "--" + state_name.strip().lower()
# A dict of { "City_name-State_name": (latitude, longitude), ... }
coords_lookup = {}
with open("cityCoordinates.csv", newline="") as f:
reader = csv.DictReader(f) # your coords file appears to have a header
for row in reader:
city = row["City"]
state = row["State"]
lat = row["Latitude"]
lon = row["Longitude"]
key = cs_key(city, state)
coords_lookup[key] = (lat, lon)
pprint.pprint(coords_lookup, sort_dicts=False)
When I run that, I get:
{'portland--or': ('45°31′12″N', '122°40′55″W'),
'dallas--tx': ('32°46′45″N', '96°48′32″W'),
'portland--me': ('43°39′36″N', '70°15′18″W')}
Now, iterating the task data looks pretty much the same: we take a pair of City and State, make a normalized key out of them, then try to look up that key for known coordinates.
I mocked up some task data:
| TaskCity | TaskState |
|------------|-----------|
| Portland | OR |
| Fort Worth | TX |
| Dallas | TX |
| Boston | MA |
| Portland | ME |
and when I run this:
with open("crq_data.csv", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
city = row["TaskCity"]
state = row["TaskState"]
key = cs_key(city, state)
coords = coords_lookup.get(key, (None, None))
if coords != (None, None):
print(city, state, coords[0], coords[1])
I get:
Portland OR 45°31′12″N 122°40′55″W
Dallas TX 32°46′45″N 96°48′32″W
Portland ME 43°39′36″N 70°15′18″W
This solution is going to be much faster in principle because you're not doing a cityCoordinates-ROWS x taskData-ROWS quadratic loop. And, in practice, Pandas suffers when doing row iteration^1, I'm not sure if the same holds for indexing (iloc), but in general Pandas is for manipulating columns of data, and I would say is not for row-oriented problems/solutions.

Grouping CSV Rows By The Names of Users

I have a table on Python with the following data from a CSV:
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| 123#yahoo.com | Brian | Computer Tech|
| example#gmail.com | Brian | Sales|
| someone#google.com |Gabby |Sales|
| testinge#sendesk.com |Gabby |Marketing|
| sandbox#aol.com |Tyler | Porter |
I want to be able to group the data by the Name and have all of the other cells come with it.
It should end up looking like this.
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| 123#yahoo.com | Brian | Computer Tech|
| example#gmail.com | Brian | Sales|
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| someone#google.com |Gabby |Sales|
| testinge#sendesk.com |Gabby |Marketing|
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| sandbox#aol.com |Tyler | Porter |
Furthermore, I want to create a new csv file for every table that is created. Can anyone help? I have tried to loop it through but have failed too many times. I am currently back to the base and only have the file propagting in its normal table. Can anyone help?
import csv
f = open('work.csv')
csv_f = csv.reader(f)
for row in csv_f:
print (row)
When you are trying to group variables based on a certain key (the name in this case) a hashmap is usually a good data structure to try.
As a general solution for future readers:
Create an empty dictionary.
Choose the key that you want to group your data.
Iterate over the data and parse the key and related items.
Add the related items to dict[key].
Now each key in dict will have a list of all the items related to it.
Tailored more specifically to the OP's question:
import collections
def write_csv(name, lines):
with open(f"{name}_work.csv", "w") as f:
for line in lines:
f.write(','.join(item for item in line))
f.write('\n')
if __name__ == "__main__":
# LOAD DATA
with open("work.csv", 'r') as f:
lines = []
for line in f.readlines():
lines.append(line.strip('\n').split(','))
# GROUP DATA BY NAME INTO A DICTIONARY
names = collections.defaultdict(list)
for email, name, job in lines[1:]:
names[name].append((email, job))
# WRITE A NEW .csv FILE FOR EACH NAME
for name in names:
new_lines = lines[:1]
for email, job in names[name]:
new_lines.append([name, email, job])
write_csv(name, new_lines)

Pyspark mapping regex

I have a pyspark dataframe, with text column.
I wanted to map the values which with a regex expression.
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-RH', 'RH'))
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-FI, 'FI'))
Plus I wanted to map specifics values according to a dictionnary, I did the following (mapper is from create_map()):
df = df.withColumn("mapped_col",mapper.getItem(F.col("action")))
Finaly the values which has not been mapped by the dictionnary or the regex expression, will be set null. I do not know how to do this part in accordance to the two others.
Is it possible to have like a dictionnary of regex expression so I can regroup the two 'functions'?
{".*-RH": "RH", ".*FI" : "FI"}
Original Output Example
+-----------------------------+
|message |
+-----------------------------+
|GDF2009 |
|GDF2014 |
|ADS-set |
|ADS-set |
|XSQXQXQSDZADAA5454546a45a4-FI|
|dadaccpjpifjpsjfefspolamml-FI|
|dqdazdaapijiejoajojp565656-RH|
|kijipiadoa
+-----------------------------+
Expected Output Example
+-----------------------------+-----------------------------+
|message |status|
+-----------------------------+-----------------------------+
|GDF2009 | GDF
|GDF2014 | GDF
|ADS/set | ADS
|ADS-set | ADS
|XSQXQXQSDZADAA5454546a45a4-FI| FI
|dadaccpjpifjpsjfefspolamml-FI| FI
|dqdazdaapijiejoajojp565656-RH| RH
|kijipiadoa | null or ??
So first 4th line are mapped with a dict, and the other are mapped using regex. Unmapped are null or ??
Thank you,
You can achieve it using contains function:
from pyspark.sql.types import StringType
df = spark.createDataFrame(
["GDF2009", "GDF2014", "ADS-set", "ADS-set", "XSQXQXQSDZADAA5454546a45a4-FI", "dadaccpjpifjpsjfefspolamml-FI",
"dqdazdaapijiejoajojp565656-RH", "kijipiadoa"], StringType()).toDF("message")
df.show()
names = ("GDF", "ADS", "FI", "RH")
def c(col, names):
return [f.when(f.col(col).contains(i), i).otherwise("") for i in names]
df.select("message", f.concat_ws("", f.array_remove(f.array(*c("message", names)), "")).alias("status")).show()
output:
+--------------------+
| message|
+--------------------+
| GDF2009|
| GDF2014|
| ADS-set|
| ADS-set|
|XSQXQXQSDZADAA545...|
|dadaccpjpifjpsjfe...|
|dqdazdaapijiejoaj...|
| kijipiadoa|
+--------------------+
+--------------------+------+
| message|status|
+--------------------+------+
| GDF2009| GDF|
| GDF2014| GDF|
| ADS-set| ADS|
| ADS-set| ADS|
|XSQXQXQSDZADAA545...| FI|
|dadaccpjpifjpsjfe...| FI|
|dqdazdaapijiejoaj...| RH|
| kijipiadoa| |
+--------------------+------+

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!
changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

How to split a csv file row to columns in python?

Sorry, I am new to python. I have a csv file that gets data from google trends and writes to that file. However the output is all written to a same column. I want the date on column A and Bitcoin on column B and Cyptocurrency on column C and so on. I am really struggling with the simple task. Can any one help please? Thanks.
Below is the sample of the csv file.
"date Bitcoin Cryptocurrency Crypto isPartial"
"2013-10-27 5 0 0 False"
"2013-11-03 5 0 0 False"
"2013-11-10 5 0 0 False"
"2013-11-17 12 0 0 False"
"2013-11-24 14 0 0 False"
"2013-12-01 13 0 0 False"
This is my code to generate the file
#login
pytrend = TrendReq(google_username,google_password)
pytrend = TrendReq()
#Payload
pytrend.build_payload(kw_list=['Bitcoin','Cryptocurrency','Crypto'])
#interest over time
interest_over_time_df = pytrend.interest_over_time()
df = pd.DataFrame(interest_over_time_df)
file_name = "/Users/username/Desktop/Bitcoin.csv"
df.to_csv(file_name, sep='\t')
here you go. You will need pandas to load into a dataframe.
import pandas as pd
dataframe= pd.read_csv('Bitcoin.csv',delimiter=r"\s+")
dataframe
First of all, take a look at the CSV documentation for python, this should give you all the info and examples you need
Then I understand you want to write your rows as CSV separated by tabs so something like this should work for you:
# First you create a csv.Writer
spamwriter = csv.writer(csvfile, delimiter='\t')
# You write a row as a list into the csv.writer
spamwriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])
I was able find a few idea from other posts. Option 1 is simply just using formatting to make it look nice while Option 2 utilizes PrettyTable to give nice and formatted answer. You can find Pretty Table documenation here
Option 1 comes this previous post. All you would have to do is play around with the numbers so that the spacing is looks good enough to make you happy and of course change the file name to match your csv file.
Option 1
You could use format to left justify your output. For example,
f = open("contactlist.csv")
csv_f = csv.reader(f)
for row in csv_f:
print('{:<15} {:<15} {:<20} {:<25}'.format(*row))
Output:
Name Phone Company Email
Elon Musk 454-6723 SpaceX emusk#spacex.com
Larry Page 853-0653 Google lpage#gmail.com
Tim Cook 133-0419 Apple tcook#apple.com
Steve Ballmer 456-7893 Developers! sballmer#bluescreen.com
You can read more about format here. The < symbol left-aligns the text, and the number specifies the width of the string. Each {} can include a positional argument before the colon : - if they are omitted, the strings will appear in the order of the arguments in the unpacked list row.
Option 2
Option 2 I was able to find this information from here, Python Pretty Table
This page give you multitude of ways for solving this problem. Inlcuding a very simple of way by using the from_csv() function that can be imported from PrettyTable by using from prettytable import from_csv. Look at the example below for better insight.
Example:
Data.csv
"City name", "Area", "Population", "Annual Rainfall"
"Adelaide", 1295, 1158259, 600.5
"Brisbane", 5905, 1857594, 1146.4
"Darwin", 112, 120900, 1714.7
"Hobart", 1357, 205556, 619.5
"Sydney", 2058, 4336374, 1214.8
"Melbourne", 1566, 3806092, 646.9
"Perth", 5386, 1554769, 869.4
Python Code:
#!/usr/bin/python3
from prettytable import from_csv
with open("data.csv", "r") as fp:
x = from_csv(fp)
print(x)
Output will look something like the following:
+-----------+------+------------+-----------------+
| City name | Area | Population | Annual Rainfall |
+-----------+------+------------+-----------------+
| Adelaide | 1295 | 1158259 | 600.5 |
| Brisbane | 5905 | 1857594 | 1146.4 |
| Darwin | 112 | 120900 | 1714.7 |
| Hobart | 1357 | 205556 | 619.5 |
| Sydney | 2058 | 4336374 | 1214.8 |
| Melbourne | 1566 | 3806092 | 646.9 |
| Perth | 5386 | 1554769 | 869.4 |
+-----------+------+------------+-----------------+
Please let me know if this was beneficial by leaving a comment or casting a vote, thank you!

Categories

Resources