Related
I'm trying to parse through a csv file and extract the data from only specific columns.
Example csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
I'm trying to capture only specific columns, say ID, Name, Zip and Phone.
Code I've looked at has led me to believe I can call the specific column by its corresponding number, so ie: Name would correspond to 2 and iterating through each row using row[2] would produce all the items in column 2. Only it doesn't.
Here's what I've done so far:
import sys, argparse, csv
from settings import *
# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
fromfile_prefix_chars="#" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file
# open csv file
with open(csv_file, 'rb') as csvfile:
# get number of columns
for line in csvfile.readlines():
array = line.split(',')
first_item = array[0]
num_columns = len(array)
csvfile.seek(0)
reader = csv.reader(csvfile, delimiter=' ')
included_cols = [1, 2, 6, 7]
for row in reader:
content = list(row[i] for i in included_cols)
print content
and I'm expecting that this will print out only the specific columns I want for each row except it doesn't, I get the last column only.
The only way you would be getting the last column from this code is if you don't include your print statement in your for loop.
This is most likely the end of your code:
for row in reader:
content = list(row[i] for i in included_cols)
print content
You want it to be this:
for row in reader:
content = list(row[i] for i in included_cols)
print content
Now that we have covered your mistake, I would like to take this time to introduce you to the pandas module.
Pandas is spectacular for dealing with csv files, and the following code would be all you need to read a csv and save an entire column into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
so if you wanted to save all of the info in your column Names into a variable, this is all you need to do:
names = df.Names
It's a great module and I suggest you look into it. If for some reason your print statement was in for loop and it was still only printing out the last column, which shouldn't happen, but let me know if my assumption was wrong. Your posted code has a lot of indentation errors so it was hard to know what was supposed to be where. Hope this was helpful!
import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('file.txt') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
print(columns['name'])
print(columns['phone'])
print(columns['street'])
With a file like
name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.
Will output
>>>
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']
Or alternatively if you want numerical indexing for the columns:
with open('file.txt') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
print(columns[0])
>>>
['Bob', 'James', 'Smithers']
To change the deliminator add delimiter=" " to the appropriate instantiation, i.e reader = csv.reader(f,delimiter=" ")
Use pandas:
import pandas as pd
my_csv = pd.read_csv(filename)
column = my_csv.column_name
# you can also use my_csv['column_name']
Discard unneeded columns at parse time:
my_filtered_csv = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
P.S. I'm just aggregating what other's have said in a simple manner. Actual answers are taken from here and here.
You can use numpy.loadtext(filename). For example if this is your database .csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
And you want the Name column:
import numpy as np
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))
>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
More easily you can use genfromtext:
b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
With pandas you can use read_csv with usecols parameter:
df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
Example:
import pandas as pd
import io
s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''
df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)
total_bill day size
0 16.99 Sun 2
1 10.34 Sun 3
2 21.01 Sun 3
Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.
Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.
from petl import fromcsv, look, cut, tocsv
#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')
I think there is an easier way
import pandas as pd
dataset = pd.read_csv('table1.csv')
ftCol = dataset.iloc[:, 0].values
So in here iloc[:, 0], : means all values, 0 means the position of the column.
in the example below ID will be selected
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
import pandas as pd
csv_file = pd.read_csv("file.csv")
column_val_list = csv_file.column_name._ndarray_values
Thanks to the way you can index and subset a pandas dataframe, a very easy way to extract a single column from a csv file into a variable is:
myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']
A few things to consider:
The snippet above will produce a pandas Series and not dataframe.
The suggestion from ayhan with usecols will also be faster if speed is an issue.
Testing the two different approaches using %timeit on a 2122 KB sized csv file yields 22.8 ms for the usecols approach and 53 ms for my suggested approach.
And don't forget import pandas as pd
If you need to process the columns separately, I like to destructure the columns with the zip(*iterable) pattern (effectively "unzip"). So for your example:
ids, names, zips, phones = zip(*(
(row[1], row[2], row[6], row[7])
for row in reader
))
import pandas as pd
dataset = pd.read_csv('Train.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
X is a a bunch of columns, use it if you want to read more that one column
y is single column, use it to read one column
[:, 1:-1] are [row_index : to_row_index, column_index : to_column_index]
SAMPLE.CSV
a, 1, +
b, 2, -
c, 3, *
d, 4, /
column_names = ["Letter", "Number", "Symbol"]
df = pd.read_csv("sample.csv", names=column_names)
print(df)
OUTPUT
Letter Number Symbol
0 a 1 +
1 b 2 -
2 c 3 *
3 d 4 /
letters = df.Letter.to_list()
print(letters)
OUTPUT
['a', 'b', 'c', 'd']
import csv
with open('input.csv', encoding='utf-8-sig') as csv_file:
# the below statement will skip the first row
next(csv_file)
reader= csv.DictReader(csv_file)
Time_col ={'Time' : []}
#print(Time_col)
for record in reader :
Time_col['Time'].append(record['Time'])
print(Time_col)
From CSV File Reading and Writing you can import csv and use this code:
with open('names.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['first_name'], row['last_name'])
To fetch column name, instead of using readlines() better use readline() to avoid loop & reading the complete file & storing it in the array.
with open(csv_file, 'rb') as csvfile:
# get number of columns
line = csvfile.readline()
first_item = line.split(',')
Im getting (with an python API) a .csv file from an email attachment that i received in gmail, transforming it into a dataframe to make some dataprep, and saving as .csv on my pc. It is working great, the problem is that i get '\n' on some columns(it came like that from the source attachment).
the code that i used to get the data and transform into dataframe and .csv
r = io.BytesIO(part.get_payload(decode = True))
df = pd.DataFrame(r)
df.to_csv('C:/Users/x.csv', index = False)
Example of df that i get:
+-------------+----------+---------+----------------------+
| Information | Modified | Created | MD_x0020_Agenda\r\n' |
+-------------+----------+---------+----------------------+
| c | d | f | \r\n' |
| b\n' | | | |
| c | e | \r\n' | |
+-------------+----------+---------+----------------------+
example of answer that is correct:
+-------------+----------+---------+----------------------+
| Information | Modified | Created | MD_x0020_Agenda\r\n' |
+-------------+----------+---------+----------------------+
| c | d | f | \r\n' |
| b | c | e | \r\n' |
+-------------+----------+---------+----------------------+
i tried to use the line_terminator. in my mind, if i force it to get only \r\n and not \n, it would work. It didnt.
df.to_csv('C:/Users/x.csv', index = False, line_terminator='\r\n')
can somebody give me a help with that? its really freaking me out, because of that i cant advance at my project. thanks.
Usually, this "\n" appears to mark that sentence is going for next line i.e ‘return’ key, line break.
You can get rid of it just by applying replace('\n', '') on your dataframe:
df = df.replace('\n', '')
For more details on the function, consider checking this specific Pandas documentation
Hope it works.
I mixed the two answers and got the solution, thanks!!!!!
PS: with some research i found that this is a windows/excel issue, when you export .csv it considers \n and \r\n (\r too?) as new row. DataFrame considers only \r\n as new row(when default).
df = pd.read_csv(io.BytesIO(part.get_payload(decode = True)), header=None)
#grab the first row for the header
new_header = df.iloc[0]
#take the data less the header row
df = df[1:]
#set the header row as the df header
df.columns = new_header
#replace the \n wich is creating new lines
df['Information'] = df['Information'].replace(regex = '\n', value = '')
df.to_csv('C:/Users/x.csv', index = False', index = False)
I am new to python, I would really appreciate the assistance. I trie the entire day. I have a csv file containing 10 columns. I am only interested in 3 state, county and zipcode. I am trying, trying and trying to get a count of the occurrences in each column for instance CA 20000, TX 14000, and having the count result outpute to be saved in a csv files that could be easily imported into excel and further merged with geospatial files.
I managed to select the 3 columns that I need
import numpy as np
from tabulate import tabulate
import pandas as pd
#Replace with path and name file in your computer
filename = "10column.csv"
# Enter the column number that you want to display between [, , ,] no space between the commas "usecols=[3,4,5]"
table = np.genfromtxt(filename,delimiter=',',skip_header=0,dtype='U',usecols=[4,5,6])
print(tabulate(table))
#Insert the path and name of the file
pd.DataFrame(table).to_csv("3column.csv")
Then I tried to count the occurrences but the output it is in the wrong format and I cannot save as csv.
import csv
from collections import Counter
import numpy as np
my_reader = csv.reader(open("3column.csv"))
#insert column number instead of the 2 "[rec[2]"
column = [rec[2] for rec in my_reader]
np.array([Counter(column)])
print(np.array([Counter(column)]))
the result is
[Counter({'22209': 10, '20007': 5, …'})]
I cannot save it as csv and I would like to have on a tabulated format
zip, count
22209, 10, 20007, 10
I would really appreciate your help
A different way to approach would be using value_counts() from Pandas documentation.
Return a Series containing counts of unique values.
Exemple data file 7column.csv
id,state,city,zip,ip_address,latitude,longitude
1,NY,New York City,10005,246.78.179.157,40.6964,-74.0253
2,WA,Yakima,98907,79.61.127.155,46.6288,-120.574
3,OK,Oklahoma City,73109,129.226.225.133,35.4259,-97.5261
4,FL,Orlando,32859,104.196.5.159,28.4429,-81.4026
5,NY,New York City,10004,246.78.180.157,40.6964,-74.0253
6,FL,Orlando,32860,104.196.5.159,29.4429,-81.4026
7,IL,Chicago,60641,19.226.187.13,41.9453,-87.7474
8,NC,Fayetteville,28314,45.109.1.38,35.0583,-79.008
9,IL,Chicago,60642,19.226.187.14,41.9453,-87.7474
10,WA,Yakima,98907,79.61.127.156,46.6288,-120.574
11,IL,Chicago,60643,19.226.187.15,41.9453,-87.7474
12,CA,Sacramento,94237,77.208.31.167,38.3774,-121.4444
import pandas as pd
df = pd.read_csv("7column.csv")
zipcode = df["zip"].value_counts()
state = df["state"].value_counts()
city = df["city"].value_counts()
zipcode.to_csv('zipcode_count.csv')
state.to_csv('state_count.csv')
city.to_csv('city_count.csv')
CSV output files
state_count.csv | city_count.csv | zipcode_count.csv
,state | ,city | ,zip
IL,3 | Chicago,3 | 98907,2
NY,2 | Orlando,2 | 32859,1
FL,2 | New York City,2 | 94237,1
WA,2 | Yakima,2 | 32860,1
NC,1 | Sacramento,1 | 28314,1
OK,1 | Fayetteville,1 | 10005,1
CA,1 | Oklahoma City,1 | 10004,1
| | 60643,1
| | 60642,1
| | 60641,1
| | 73109,1
You could read the file you wrote out to CSV in as a DataFrame and use the count method that Pandas has.
states_3 = pd.DataFrame(table)
state_count = states_3.count(axis='columns')
out_name = 'statecount.xlsx'
with pd.ExcelWriter(out_name) as writer:
state_count.to_excel(writer, sheet_name='counts')
I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!
changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!
I'm new to python and would like to pass the ZipCode in excel file to 'uszipcode' package and write the state for that particular zipcode to 'OriginalZipcode' column in the excel sheet. The reason for doing this is, I want to compare the existing states with the original states. I don't understand if the for loop is wrong in the code or something else is. Currently, I cannot write the states to OriginalZipcode column in excel. The code I've written is:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import uszipcode as US
from uszipcode import ZipcodeSearchEngine
search = ZipcodeSearchEngine()
df = pd.read_excel("H:\excel\checking for zip and states\checkZipStates.xlsx", sheet_name='Sheet1')
#print(df.values)
for i, row in df.iterrows():
zipcode = search.by_zipcode(row['ZipCode']) #for searching zipcode
b = zipcode.State
df.at['row','OriginalState'] = b
df.to_excel("H:\\excel\\checking for zip and states\\new.xlsx", sheet_name = "compare", index = False)
The excel sheet is in this format:
| ZipCode |CurrentState | OriginalState |
|-----------|-----------------|---------------|
| 59714 | Montana | |
| 29620 | South Carolina | |
| 54405 | Wisconsin | |
| . | . | |
| . | . | |
You can add the OriginalState column without iterating the df:
Define a function that returns the value you want for any given zip code:
def get_original_state(state):
zipcode = search.by_zipcode(state) #for searching zipcode
return zipcode.State
Then:
df['OriginalState'] = df.apply( lambda row: get_original_state(row['ZipCode']), axis=1)
Finally, export only once the df to excel.
This should do the trick.