I've got a lot of csv files which contain strings. I would like to import the strings in python 3 from the multiple csvs to a master csv but making sure that no duplicates which are already contained in the master csv are added.
I've written some code but I'm unsure of how to get the print to be written to the master csv and how to check for duplicates.
My current code is:
output = [ ]
f = open( 'example.csv' , 'r' )
for line in f:
cells = line.split( "," )
output.append( ( cells[ 3 ]))
f.close( )
print (output)
Any help would be appreciated.
Thanks in advance.
The answer really depends on how big those CSV files are i.e. how many words do you expect to end up in master CSV. Based on that you can have more or less optimized Python code.
First things first, you should provide some sort of example since from what is shown, you take strings from third column and put them in output list.
One solution could be this:
from csv import reader
words = set()
# open master CSV file in case it already exists and load all words
# now, this is the part where you didn't give an example of how master CSV should look like
# I'll assume its just a word per line text file
with open(MASTER_CSV_FILE, 'r') as f:
for line in f:
words.append(line)
with open(NEW_CSV_FILE, 'r') as f:
for columns in reader(f):
words.append(columns[3])
# here again, I'll just write word per line in MASTER_CSV_FILE
with open(MASTER_CSV_FILE, 'w') as f:
for word in words:
f.write(word + '\n')
I have based my answer on next assumptions:
master CSV file is actually word per line text file (due to a lack of examples),
new CSV file always have at least 3 comma separated values in each row,
you just want to dedupe words and do not want to count number duplicates.
Here's another way that may work for you.
import pandas as pd
# Create a DataFrame that will be used to load all the data.
# The duplicates will be removed once all the csv's have been
# loaded
df = pd.DataFrame()
# Read the contents of the csv files into the DataFrame.
# I'm assuming all the csv's have the same data format.
for f in os.listdir():
if f.endswith(".csv"):
df = df.append(pd.read_csv(f))
# Eliminate the duplicates. This will use the values in
# all the columns of the DataFrame to determine whether
# a particular row is a duplicate.
df.drop_duplicates(inplace=True)
You can then convert the DataFrame back to a csv file by using df.to_csv() if needed.
Hope that helps.
Related
For a project I have devices who send payloads and I should store them on a localfile, but I have memory limitation and I dont want to store more than 2000 data rows. again for the memory limitation I cannot have a database so I chose to store data in csv file.
I tried to use open('output.csv', 'r+') as f: ; I'm appending the rows to the end of my csv and I have to check each time the lenght with sum(1 for line in f) to be sure its not more than 2000.
The big problem starts when I reach 2000 rows and I want to ideally delete the first row and add another row to the end or start to write rows from the beginning of the file and overwrite the old rows without deleting evrything, but I dont know how to do it. I tried to use open('output.csv', 'w+') or open('output.csv', 'a+') but it will delete all the contents with w+ while writing only one row and by a+ it just continues to append to the end. I on the otherhand I cannot count the number of rows anymore with both. can you pleas help me which command should I use to start to rewrite each line from the beginning or delete one line from the beginning and append one to the end? I will also appriciate if you can tell me if there is a better chioce than csv files for storing many data or I can use a better way to count the number of rows.
This should help. See comments inline
import pandas as pd
allowed_length = 2 # Set it to the required value
df = pd.read_csv('output.csv') #Read your csv file to df
row_count = df.shape[0] #Get row count
df.loc[row_count] = ['Fridge', 15] #Insert row at end of df. In my case it has only 2 values
#if count of dataframe is greater or equal to allowed_length, the delete first row
if row_count >= allowed_length:
df = df.drop(df.head(1).index)
df.to_csv('output.csv', index=False)
I would like to create a merged cell in an excel document with a word in it:
The function would ideally take as an input arguments: start_row, start_col, finish_row, finish_col and word.
This function would edit the document such that I have a merged cell in the range given to the function.
An example usage:
In this example I would get a tralala word from column C to column E from row 1 to row 1.
The easiest way to handle excels is either by saving the excel sheet as CSV (which is an export option in excel) or by using pandas:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
If you save the file as CSV and the built in function in pandas does not help you (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) you can do:
with open(filepathhere, "r") as f:
for line in f:
line.split(";") # Find correct seperator here
print(line) # Here you can find which index you want to retrieve to get your value.
Have a question. If I run this
code, I get exactly what I have in the CSV. My target is to get a text/data/csv file which would look like this:
['red', 'green', 'blue'] meaning:
Converting a single column into a row.
While converting to row, entering a comma to differentiate values.
Is it possible to do it through Python? Is there any online materials I can look into?
The csv file is read sequentially, that is, you will always get the data back one row at a time. However, you can build an array of just the values you want as you read the file and discard the rest of the data.
import csv
with open('example.csv') as csvfile:
colours = []
for row in csv.reader(csvfile, delimiter=','):
if len(row) <= 3:
continue
colours.append(row[3])
You can achieve this code using pandas. Try this code:
import pandas as pd
df = pd.read_csv("input csv file")
df = df.T
df.to_csv("output csv file")
Update if this is what you are looking for
Without using any extra libraries, you could write a function that takes a specific column, and returns that column as an array.
Such a function might look like this.
def column_to_row(col_num, csv):
return list(row[col_num] for row in csv)
Then you can extract whatever column you want, or iterate through the whole csv like this.
new_csv = []
for i in range(0, len(csv[0]):
new_csv = new_csv.append(column_to_row(i, csv))
I'm trying to create a dictionary file for a big size csv file that is divided into chunks to be processed, but when I'm creating the dictionary its just doing it for one chuck, and when I try to append it it passes epmty dataframe to the new df. this is the code I used
wdata = pd.read_csv(fileinput, nrows=0,).columns[0]
skip = int(wdata.count(' ') == 0)
dic = pd.DataFrame()
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1000):
dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
dic.append(dic_tmp)
dic.to_csv('newwww.csv', index=False)
if I saved the dic_tmp one is just a dictionary for one chunk not the whole set and dic is taking alot of time to process but returns empty dataframes at the end, any error with my code ?
input csv is like
output csv is like
expected output should be
so its not adding the chunks together its just pasting the new chunk regardless what is in the previous chunk or the csv.
In order to split the column into words and count the occurrences:
df['sentences'].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis=0)
or
from collections import Counter
result = Counter(" ".join(df['sentences'].values.tolist()).split(" ")).items()
both seem to be equally slow, but probably better than your approach.
Taken from here:
Count distinct words from a Pandas Data Frame
Couple of problems that I see are
Why read the csv file twice?
First time here wdata = pd.read_csv(fileinput, nrows=0,).columns[0] and second time in the for loop.
If you aren't using the combined data frame further. I think it is better to write the chunks to csv file in append mode like shown below
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1000):
dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
dic_tmp.to_csv('newwww.csv', mode='a', header=False)
I have a CSV file with text data separated by commas in some columns, but not in others, e.g.:
https://i.imgur.com/X6bq09I.png
I want to export each row of my CSV file to a new CSV file. An example desired output for the first row of my original file would look like this:
https://i.imgur.com/QB9sLeL.png
I have tried the code offered in the first answer of this post: Open CSV file and writing each row to new, dynamically named CSV file.
This is the code I used:
import csv
counter = 1
with open('mock_data.csv', 'rU') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
if row:
filename = "trial%s" % str(counter)
with open(filename, 'w') as csvfile_out:
writer = csv.writer(csvfile_out)
writer.writerow(row)
counter = counter + 1
This code does produce a new .csv file for each row. However...
EDIT: I have three remaining issues, for which I have not found the right code:
I want each word to have its own cell in each row; I don't know
how to do this when certain cells contain a multiple words separated
by commas, while other cells contain only a single word;
Once each word has its own cell, I want to transpose each row into a single column in the new .csv file;
I want to remove duplicate values from the column.
If you actually want a file extension, then use filename = "trial%s.csv" % str(counter)
But CSV files don't care about file extensions. Any file reader or code should be able to read the file.
TextEdit is just the Mac default for that.
I need a single column with one word in each cell, in each new output file
When you do writer.writerow(row), then make sure if len(row) == 1 rather than if row