I know this is a very easy task, but i am acting pretty dumb right now and dont get it solved. I need to copy the first column of a .csv file including header into a newly created file. My code:
station = 'SD_01'
import csv
import pandas as pd
df = pd.read_csv(str( station ) + "_ED.csv", delimiter =';')
list1 = []
matrix1 = df[df.columns[0]].as_matrix()
list1 = matrix1.tolist()
with open('{0}_RRS.csv'.format(station),"r+") as f:
writer = csv.writer(f)
writer.writerows(map(lambda x: [x], list1))
As result, my file has an empty line between the values, has no header (i could continue without the header, though) and something at the bottom which a can not identify.
>350
>
>351
>
>352
>
>...
>
>949
>
>950
>
>Ž‘’“”•–—˜™š›œžŸ ¡¢
Just a short impression of the 1200+ lines
I am pretty sure that this is a very clunky way to do this; easyier ways are always welcome.
How do i get rid of all the empty lines and this crazy stuff in the end?
When you get a column from a dataframe, it's returned as type Series - and the Series has a built in to_csv method you can use. So you don't need to do any matrix casting or anything like that.
import pandas as pd
df = pd.read_csv('name.csv',delimiter=';')
first_column = df[[df.columns[0]]
first_column.to_csv('new_file.csv')
Related
I am trying to analyze a large dataset from Yelp. Data is in json file format but it is too large, so script is crahsing when it tries to read all data in same time. So I decided to read line by line and concat the lines in a dataframe to have a proper sample from the data.
f = open('./yelp_academic_dataset_review.json', encoding='utf-8')
I tried without encoding utf-8 but it creates an error.
I created a function that reads the file line by line and make a pandas dataframe up to given number of lines.
Anyway some lines are lists. And script iterates in each list too and adds to dataframe.
def json_parser(file, max_chunk):
f = open(file)
df = pd.DataFrame([])
for i in range(2, max_chunk + 2):
try:
type(f.readlines(i)) == list
for j in range(len(f.readlines(i))):
part = json.loads(f.readlines(i)[j])
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
except:
f = open(file, encoding = "utf-8")
for j in range(len(f.readlines(i))):
try:
part = json.loads(f.readlines(i)[j-1])
except:
print(i,j)
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
df2.reset_index(inplace=True, drop=True)
return df2
But still I am having an error that list index out of range. (Yes I used print to debug).
So I looked closer to that lines which causes this error.
But very interestingly when I try to look at that lines, script gives me different list.
Here what I meant:
I runned the cells repeatedly and having different length of the list.
So I looked at lists:
It seems they are completely different lists. In each run it brings different list although line number is same. And readlines documentation is not helping. What am I missing?
Thanks in advance.
You are using the expression f.readlines(i) several times as if it was referring to the same set of lines each time.
But as as side effect of evaluating the expression, more lines are actually read from the file. At one point you are basing the indices j on more lines than are actually available, because they came from a different invocation of f.readlines.
You should use f.readlines(i) only once in each iteration of the for i in ... loop and store its result in a variable instead.
I am trying to read a csv file with pandas. File has 14993 line after headers.
data = pd.read_csv(filename, usecols=['tweet', 'Sentiment'])
print(len(data))
it prints : 14900 and if I add one line to the end of file it is now 14901 rows, so it is not because of memory limit etc. And I also tried "error_bad_lines" but nothing has changed.
By the name of your headers one can supect that you have free text. That can easily trip any csv-parser.
In any case here's a version that easily allows you to track down inconsistencies in the csv, or at least gives a hint of what to look for… and then puts it into a dataframe.
import csv
import pandas as pd
with open('file.csv') as fc:
creader = csv.reader(fc) # add settings as needed
rows = [r for r in creader]
# check consistency of rows
print(len(rows))
print(set((len(r) for r in rows)))
print(tuple(((i, r) for i, r in enumerate(rows) if len(r) == bougus_nbr)))
# find bougus lines and modify in memory, or change csv and re-read it.
# assuming there are headers...
columns = list(zip(*rows))
df = pd.DataFrame({k: v for k, *v in columns if k in ['tweet', 'Sentiment']})
if the dataset is really big, the code should be rewritten to only use generators (which is not that hard to do..).
Only thing not to forget when using a technique like this is that if you have numbers, those columns should be recasted to suitable datatype if needed, but that becomes self evident if one attempts to do math on a dataframe filled with strings.
I have a csv file that I need to change the date value in each row. The date to be changed appears in the exact same column in each row of the csv.
import csv
firstfile = open('example.csv',"r")
firstReader = csv.reader(firstfile, delimiter='|')
firstData = list(firstReader)
DateToChange = firstData[1][25]
ChangedDate = '2018-09-30'
for row in firstReader:
for column in row:
print(column)
if column==DateToChange:
#Change the date
outputFile = open("output.csv","w")
outputFile.writelines(firstfile)
outputFile.close()
I am trying to grab and store a date already in the csv and change it using a for loop, then output the original file with the changed dates. However, the code above doesn't seem to do anything at all. I am newer to Python so I might not be understanding how to use a for loop correctly.
Any help at all is greatly appreciated!
When you call list(firstReader), you read all of the CSV data in to the firstData list. When you then, later, call for row in firstReader:, the firstReader is already exhausted, so nothing will be looped. Instead, try changing it to for row in firstData:.
Also, when you are trying to write to file, you are trying to write firstFile into the file, rather than the altered row. I'll leave you to figure out how to update the date in the row, but after that you'll need to give the file a string to write. That string should be ', '.join(row), so outputFile.write(', '.join(row)).
Finally, you should open your output file once, not each time in the loop. Move the open call to above your loop, and the close call to after your loop. Then when you have a moment, search google for 'python context manager open file' for a better way to manage the open file.
you could use pandas and numpy. Here I create a dataframe from scratch but you could load it directly from a .csv:
import pandas as pd
import numpy as np
date_df = pd.DataFrame(
{'col1' : ['12', '14', '14', '3412', '2'],
'col2' : ['2018-09-30', '2018-09-14', '2018-09-01', '2018-09-30', '2018-12-01']
})
date_to_change = '2018-09-30'
replacement_date = '2018-10-01'
date_df['col2'] = np.where(date_df['col2'] == date_to_change, replacement_date, date_df['col2'])
I'm new in Python language and i'm facing a small challenge in which i havent been able to figure it out so far.
I receive a csv file with around 30-40 columns and 5-50 rows with various details in each cell. The 1st row of the csv has the title for each column and by the 2nd row i have item values.
What i want to do is to create a python script which will read the csv file and every time to do the following:
Add a row after the actual 1st item row, (literally after the 2nd row, cause the 1st row is titles), and in that new 3rd row to contain the same information like the above one with one difference only. in the column "item_subtotal" i want to add the value from the column "discount total".
all the bellow rows should remain as they are, and save this modified csv as a new file with the word "edited" added in the file name.
I could really use some help because so far i've only managed to open the csv file with a python script im developing, but im not able so far to add the contents of the above row to that newly created row and replace that specific value.
Looking forward any help.
Thank you
Here Im attaching the CSV with some values changed for privacy reasons.
order_id,order_number,date,status,shipping_total,shipping_tax_total,fee_total,fee_tax_total,tax_total,discount_total,order_total,refunded_total,order_currency,payment_method,shipping_method,customer_id,billing_first_name,billing_last_name,billing_company,billing_email,billing_phone,billing_address_1,billing_address_2,billing_postcode,billing_city,billing_state,billing_country,shipping_first_name,shipping_last_name,shipping_address_1,shipping_address_2,shipping_postcode,shipping_city,shipping_state,shipping_country,shipping_company,customer_note,item_id,item_product_id,item_name,item_sku,item_quantity,item_subtotal,item_subtotal_tax,item_total,item_total_tax,item_refunded,item_refunded_qty,item_meta,shipping_items,fee_items,tax_items,coupon_items,order_notes,download_permissions_granted,admin_custom_order_field:customer_type_5
15001_TEST_2,,"2017-10-09 18:53:12",processing,0,0.00,0.00,0.00,5.36,7.06,33.60,0.00,EUR,PayoneCw_PayPal,"0,00",0,name,surname,,name.surname#gmail.com,0123456789,"address 1",,41541_TEST,location,,DE,name,surname,address,01245212,14521,location,,DE,,,1328,302,"product title",103,1,35.29,6.71,28.24,5.36,0.00,0,,"id:1329|method_id:free_shipping:3|method_title:0,00|total:0.00",,id:1330|rate_id:1|code:DE-MWST-1|title:MwSt|total:5.36|compound:,"id:1331|code:#getgreengent|amount:7.06|description:Launchcoupon for friends","text string",1,
You can also use pandas to manipulate the data from the csv like this:
import pandas
import copy
Read the csv file into a pandas dataframe:
df = pandas.read_csv(filename)
Make a deepcopy of the first row of data and add the discount total to the item subtotal:
new_row = copy.deepcopy(df.loc[1])
new_row['item_subtotal'] += new_row['discount total']
Concatenate the first 2 rows with the new row and then everything after that:
df = pandas.concat([df.loc[:1], new_row, df.loc[2:]], ignore_index=True)
Change the filename and write the out the new csv file:
filename = filename.strip('.csv') + 'edited.csv'
df.to_csv(filename)
I hope this helps! Pandas is great for cleanly handling massive amounts of data, but may be overkill for what you are trying to do. Then again, maybe not. It would help to see an example data file.
The first step is to turn that .csv into something that is a little easier to work with. Fortunately, python has the 'csv' module which makes it easy to turn your .csv file into a much nicer list of lists. The below will give you a way to both turn your .csv into a list of lists and turn the modified data back into a .csv file.
import csv
import copy
def csv2list(ifile):
"""
ifile = the path of the csv to be converted into a list of lists
"""
f = open(ifile,'rb')
olist=[]
c = csv.reader(f, dialect='excel')
for line in c:
olist.append(line) #and update the outer array
f.close
return olist
#------------------------------------------------------------------------------
def list2csv(ilist,ofile):
"""
ilist = the list of lists to be converted
ofile = the output path for your csv file
"""
with open(ofile, 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
[csvwriter.writerow(x) for x in ilist]
Now, you can simply copy list[1] and change the appropriate element to reflect your summed value using:
listTemp = copy.deepcopy(ilist[1])
listTemp[n] = listTemp[n] + listTemp[n-x]
ilist.insert(2,listTemp)
As for how to change the file name, just use:
import os
newFileName = os.path.splitext(oldFileName)[0] + "edited" + os.path.splitext(oldFileName)[1]
Hopefully this will help you out!
I am very new to python programming. I am trying to take a csv file that has two columns of string values and want to compare the similarity ratio of the string between both columns. Then I want to take the values and output the ratio in another file.
The csv may look like this:
Column 1|Column 2
tomato|tomatoe
potato|potatao
apple|appel
I want the output file to show for each row, how similar the string in Column 1 is to Column 2. I am using difflib to output the ratio score.
This is the code I have so far:
import csv
import difflib
f = open('test.csv')
csf_f = csv.reader(f)
row_a = []
row_b = []
for row in csf_f:
row_a.append(row[0])
row_b.append(row[1])
a = row_a
b = row_b
def similar(a, b):
return difflib.SequenceMatcher(a, b).ratio()
match_ratio = similar(a, b)
match_list = []
for row in match_ratio:
match_list.append(row)
with open("output.csv", "wb") as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(match_list)
f.close()
I get the error:
Traceback (most recent call last):
File "comparison.py", line 24, in <module>
for row in match_ratio:
TypeError: 'float' object is not iterable
I feel like I am not importing the column list correctly and running it against the sequencematcher function.
Here is another way to get this done using pandas:
Consider your csv data is like this:
Column 1,Column 2
tomato,tomatoe
potato,potatao
apple,appel
CODE
import pandas as pd
import difflib as diff
#Read the CSV
df = pd.read_csv('datac.csv')
#Create a new column 'diff' and get the result of comparision to it
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)
#Save the dataframe to CSV and you could also save it in other formats like excel, html etc
df.to_csv('outdata.csv',index=False)
Result
Column 1,Column 2 ,diff
tomato,tomatoe ,0.923076923077
potato,potatao ,0.923076923077
apple,appel ,0.8
The for loop you're setting up here expects something like an array where you have match_ratio, and judging by the error you're getting, that's not what you have. It looks like you're missing the first argument for difflib.SequenceMatcher, which should probably be None. See 6.3.1 here: https://docs.python.org/3/library/difflib.html
Without that first argument specified, I think you're getting back 0.0 from difflib.SequenceMatcher and then trying to run ratio off of that. Even if you correct your SequenceMatcher call, I think you'll still be trying to iterate on a single float value that ratio is returning. I think you need to call SequenceMatcher inside the loop for each set of values you're comparing.
So you'd wind up with a call more like this in your function: difflib.SequenceMatcher(None, a, b). Or if you'd prefer, since these are named arguments, you could do something like this: difflib.SequenceMatcher(a=a, b=b).
Your sample file looks like it contains markup tags. Assuming you are actually reading a CSV file, the error you are getting is because match_ratio is not an iterable datatype, it's a floating point number -- the return value of your function: similar(). In your code, the function call would have to be contained within a for loop to call it for each a, b string pair. Here's a working example I created that does away with the explicit for loops and uses a list comprehension instead:
import csv
from difflib import SequenceMatcher
path_in = 'csv1.csv'
path_out = 'csv2.csv'
with open(path_in, 'r') as csv_file_in:
csv_reader = csv.reader(csv_file_in)
col_headers = csv_reader.next()
for row in csv_reader:
results = [[row[0],
row[1],
SequenceMatcher(None, row[0], row[1]).ratio()]
for row in csv_reader]
with open(path_out, 'wb') as csv_file_out:
col_headers.append('Ratio')
out_rows = [col_headers] + results
writer = csv.writer(csv_file_out, delimiter=',')
writer.writerows(out_rows)
In addition to the error you received you might also have run into a problem when instantiating the SequenceMatcher object -- its first parameter wasn't specified in your code. You can find more on list comprehensions and SequenceMatcher in the Python docs. Good luck in your future Python coding.
You are getting that error because the records row[0] or row[1] contain most probably NaN values.
Try forcing them to string first by making str(row[0]) and str(row[1])
You are getting the error because you are running SequenceMatcher on the list of strings, rather than on the strings themselves. When you do this, you get back a single float value, rather than the list of ration values I think you were expecting.
If I understand what you are trying to do, then you don't need to read in the rows first. You can simply find the diff ratio as you iterate through the rows.
import csv
import difflib
match_list = []
with open('test.csv') as f:
csv_f = csv.reader(f)
for row in csv_f:
match_list.append([difflib.SequenceMatcher(a=row[0], b=row[1]).ratio()])
with open('output.csv', 'w') as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(match_list)