I have a csv file, which has only a single column , which acts as my input.
I use that input to find my outputs. I have multiple outputs and I need those outputs in another csv file.
Can anyone please suggest me the ways on how to do it ?
Here is the code :
import urllib.request
jd = {input 1}
//
Some Codes to find output - a,b,c,d,e
//
** Code to write output to a csv file.
** Repeat the code with next input of input csv file.
Input CSV File has only a single column and is represented below:
1
2
3
4
5
Output would in a separate csv in a given below format :
It would be in multiple rows and multiple columns format.
a b c d e
Here is a simple example:
The data.csv is a csv with one column and multiple rows.
The results.csv contain the mean and median of the input and is a csv with 1 row and 2 columns (mean is in 1st column and median in 2nd column)
Example:
import numpy as np
import pandas as pd
import csv
#load the data
data = pd.read_csv("data.csv", header=None)
#calculate things for the 1st column that has the data
calculate_mean = [np.mean(data.loc[:,0])]
calculate_median = [np.median(data.loc[:,0])]
results = [calculate_mean, calculate_median]
#write results to csv
row = []
for result in results:
row.append(result)
with open("results.csv", "wb") as file:
writer = csv.writer(file)
writer.writerow(row)
In pseudo code, you'll do something like this:
for each_file in a_folder_that_contains_csv: # go through all the `inputs` - csv files
with open(each_file) as csv_file, open(other_file) as output_file: # open each csv file, and a new csv file
process_the_input_from_each_csv # process the data you read from the csv_file
export_to_output_file # export the data to the new csv file
Now, I won't write a full-working example because it's better for you to start digging and ask specific questions when you have some. You're now just asking: write this for me because I don't know Python.
here is the official documentation
here you can read about the csv module
here you can read about the os module
I think you need read_csv for reading file to Series and to_csv for writing output Series to file in looping by Series.iteritems.
#file content
1
3
5
s = pd.read_csv('file', squeeze=True, names=['a'])
print (s)
0 1
1 3
2 5
Name: a, dtype: int64
for i, val in s.iteritems():
#print (val)
#some operation with scalar value val
df = pd.DataFrame({'a':np.arange(val)})
df['a'] = df['a'] * 10
print (df)
#write to csv, file name by val
df.to_csv(str(val) + '.csv', index=False)
a
0 0
a
0 0
1 10
2 20
a
0 0
1 10
2 20
3 30
4 40
Related
I write into a csv by this function:
def write_csv(hlavicka: Tuple[str, ...], zaznam: list, pomocne_csv: str) -> None:
if not os.path.isfile(pomocne_csv):
with open(pomocne_csv, "w", encoding=cfg.ENCODING, newline="") as soubor:
writer = csv.writer(soubor, delimiter=cfg.DELIMITER)
writer.writerow(hlavicka)
with open(pomocne_csv, "a", encoding=cfg.ENCODING, newline="") as soubor:
writer = csv.writer(soubor, delimiter=cfg.DELIMITER)
writer.writerows([zaznam])
However, when I open the csv in MS Office, I see that long numbers are in the scientific notation. For example 102043292003060000 is displayed as 1.02E+17. Of course, I put 102043292003060000 into my write_csv() function.
The problem is that when I read the csv using:
def generuj_zaznamy(input_path):
with open(input_path, "r", encoding="cp1250") as file_object:
reader = csv.reader(file_object, delimiter=";")
for entry in enumerate(reader, start=1):
print(entry)
I got 1.02E+17 instead of 102043292003060000.
Is there a way how to format the cell as a number directly in csv.writer or csv.reader? Thanks a lot.
Using the text editor like notepad.exe to open the csv file, you should see the value of a long numbers accurately. So, the problem comes from office excel but not csv.writer.
If you want to see the long numbers accurately from csv file, you should create a new xlsx file and use the function(Data->Get External Data->From text) to select the csv file for importing, and then choose the data format of the column as Text.
Edited:
I tried the code and it seems that the problem also happens to pandas.DataFrame.to_csv() but not only csv.writer() when the length of the number comes to 20 or more, which is out of the range of np.int64.
I readed the offical document and seems that float_format arg can't solve this problems.
The solution I can give now is here, if you can read the original data in string format for the length of the number more than 20:
import numpy as np
import pandas as pd
import csv
df = pd.DataFrame(["3100000035155588379531799826432", "3100000035155588433002733375488", "3100000035155588355694446120960"])
df = "\t" + df
print(df)
df.to_csv("test.csv", index=False, header=False)
rng = np.random.default_rng(0)
big_nums = rng.random(10) * (10**19) # OverflowError while comes to 10**20
df = pd.DataFrame(big_nums, dtype=np.int64).astype(str)
# df = "\t" + df
print(df)
df.to_csv("test.csv", index=False, header=False)
and the output will like that:
0
0 \t3100000035155588379531799826432
1 \t3100000035155588433002733375488
2 \t3100000035155588355694446120960
0
0 6369616873214542848
1 2697867137638703104
2 409735239361946880
3 165276355285290944
4 8132702392002723840
5 9127555772777217024
6 6066357757671798784
7 7294965609839983616
8 5436249914654228480
9 -9223372036854775808
I am a beginner and looking for a solution. I am trying to compare columns from two CSV files with no header. The first one has one column and the second one has two.
File_1.csv: #contains 2k rows with random numbers.
1
4
1005
.
.
.
9563
File_2.csv: #Contains 28k rows
0 [81,213,574,697,766,1074,...21622]
1 [0,1,4,10,12,13,1005, ...31042]
2 [35,103,85,1023,...]
3 [4,24,108,76,...]
4 []
.
.
.
28280 [0,1,9,10,32,49,56,...]
I want first to compare the column of File_1 with the first column of File_2 and find out if they match and extract the matching values plus the second column of file2 into a new CSV file (output.csv) deleting the not matching values. For example,
output.csv:
1 [0,1,4,10,12,13,1005, ...31042]
4 []
.
.
.
Second, I want to compare the File_1.csv column (iterate 2k rows) with the second column (each array) of the output.csv and find the matching values and delete the ones that do not, and I want to save those matching values into the output.csv file and also keeping the first column of that file. For example, 4 was deleted as it didn't have any values in the second column (array) as there were no numbers to compare to File_1, but there are others like 1 that did have some that match"
output.csv:
1 [1,4,1005]
.
.
.
I found a code that works for the first step, but it does not save the second column. I have been looking at how to compare arrays, but I haven't been able to.
This is what I have so far,
import csv
nodelist = []
node_matches = []
with open('File_1.csv', 'r') as f_rand_node:
csv_f = csv.reader(f_rand_node)
for row in csv_f:
nodelist.append(row[0])
set_node = set(nodelist)
with open('File_2.csv', 'r') as f_tbl:
with open('output.csv', 'w') as f_out:
csv_f = csv.reader(f_tbl)
for row in csv_f:
set_row = set(' '.join(row).split(' '))
if set_row.intersection(set_node):
node_match = list(set_row.intersection(set_node))[0]
f_out.write(node_match + '\n')
Thank you for the help.
I'd recommend to use pandas for this case.
File_1.csv:
1
4
1005
9563
File_2.csv:
0 [81,213,574,697,766,1074]
1 [0,1,4,10,12,13,1005,31042]
2 [35,103,85,1023]
3 [4,24,108,76]
4 []
5 [0,1,9,10,32,49,56]
Code:
import pandas as pd
import csv
file1 = pd.read_csv('File_1.csv', header=None)
file1.columns=['number']
file2 = pd.read_csv('File_2.csv', header=None, delim_whitespace=True, index_col=0)
file2.columns = ['data']
df = file2[file2.index.isin(file1['number'].tolist())] # first step
df = df[df['data'] != '[]'] # second step
df.to_csv('output.csv', header=None, sep='\t', quoting=csv.QUOTE_NONE)
Output.csv:
1 [0,1,4,10,12,13,1005,31042]
The entire thing is a lot easier with pandas DataFrames:
import pandas as pd
#Read the files into two dataFrames
df1= pd.read_csv("File_1.csv")
df2= pd.read_csv("File_2.csv")
df2.set_index("Column 0")
df2= df2.filter(items = df1)
index= df1.values()
df2 = df2.applymap(lambda x: set(x).intersection(index))
df.to_csv("output.csv")
This should do the trick, quite simply.
I have a data frame frame from pandas and now I want to add columns names, but only for the second row. Here is an example of my previous output:
Desired output:
My code:
data_line=open("file1.txt", mode="r")
lines=[]
for line in data_line:
lines.append(line)
for i, line in enumerate(lines):
# print('{}={}'.format(i+1, line.strip()))
file1_header=lines[0]
num_line=1
Dictionary_File1={}
Value_File1= data_type[0:6]
Value_File1_short=[]
i=1
for element in Value_File1:
type=element.split(',')
Value_File1_short.append(type[0] + ", " + type[1] + ", " + type[4])
i += 1
Dictionary_File1[ file1_header]=Value_File1_short
pd_file1=pd.DataFrame.from_dict(Dictionary_File1)
You should have a look at DataFrame.read_csv. The header keyword parameter allows you to indicate a line in the file to use for header names.
You could probably do it with something like:
pd.read_csv("file1.txt", header=1)
From my python shell I tested it out with:
>>> from io import StringIO # I use python3
>>> import pandas as pd
>>> >>> data = """Type Type2 Type3
... A B C
... 1 2 3
... red blue green"""
>>> # StringIO below allows us to use "data" as input to read_csv
>>> # "sep" keyword is used to indicate how columns are separated in data
>>> df = pd.read_csv(StringIO(data), header=1, sep='\s+')
>>> df
A B C
0 1 2 3
1 red blue green
You can write a row using the csv module before writing your dataframe to the same file. Notice this won't help when reading back to Pandas, which doesn't work with "duplicate headers". You can create MultiIndex columns, but this isn't necessary for your desired output.
import pandas as pd
import csv
from io import StringIO
# input file
x = """A,B,C
1,2,3
red,blue,green"""
# replace StringIO(x) with 'file.txt'
df = pd.read_csv(StringIO(x))
with open('file.txt', 'w', newline='') as fout:
writer = csv.writer(fout)
writer.writerow(['Type', 'Type2', 'Type3'])
df.to_csv(fout, index=False)
# read file to check output is correct
df = pd.read_csv('file.txt')
print(df)
# Type Type2 Type3
# 0 A B C
# 1 1 2 3
# 2 red blue green
So if I understand properly, you have a file "file.txt" containing your data, and a list containing the types of your data.
You want to add the list of types, to the pandas.DataFrame of your data. Correct?
If so, you can read the data from the txt file into a pandas.df using pandas.read_csv(), and then define the columns headers using df.columns.
So it would look something like:
df = pd.read_csv("file1.txt", header=None)
df.columns = data_type[0:6]
I hope this helps!
Cheers
I want to convert a data set of an .dat file into csv file. The data format looks like,
Each row begins with the sentiment score followed by the text associated with that rating.
I want the have sentiment value of (-1 or 1) to have a column and the text of review corresponding to the sentiment value to have an review to have an column.
WHAT I TRIED SO FAR
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import csv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("train.dat").readlines()]
# write it as a new CSV file
with open("train.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
def your_func(row):
return row['Sentiments'] / row['Review']
columns_to_keep = ['Sentiments', 'Review']
dataframe = pd.read_csv("train.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)
print dataframe
Sample screen shot of the resulting train.csv it has an comma after every word in the review.
If all your rows follow that consistent format, you can use pd.read_fwf. This is a little safer than using read_csv, in the event that your second column also contains the delimiter you are attempting to split on.
df = pd.read_fwf('data.txt', header=None,
widths=[2, int(1e5)], names=['label', 'text'])
print(df)
label text
0 -1 ieafxf rjzy xfxk ymi wuy
1 1 lqqm ceegjnbjpxnidygr
2 -1 zss awoj anxb rfw kgbvnl
data.txt
-1 ieafxf rjzy xfxk ymi wuy
+1 lqqm ceegjnbjpxnidygr
-1 zss awoj anxb rfw kgbvnl
As mentioned in the comments, read_csv would be appropriate here.
df = pd.read_csv('train_csv.csv', sep='\t', names=['Sentiments', 'Review'])
Sentiments Review
0 -1 alskjdf
1 1 asdfa
2 1 afsd
3 -1 sdf
I am splitting a CSV file based on a column with dates into separate files. However, some rows do contain a date but the others cells are empty. I want to remove these rows that contain empty cells from the CSV. But I'm not sure how to do this.
Here's is my code:
csv.field_size_limit(sys.maxsize)
with open(main_file, "r") as fp:
root = csv.reader(fp, delimiter='\t', quotechar='"')
result = collections.defaultdict(list)
next(root)
for row in root:
year = row[0].split("-")[0]
result[year].append(row)
for i,j in result.items():
row_count = sum(1 for row in j)
print(row_count)
file_path = "%s%s-%s.csv"%(src_path, i, row_count)
with open(file_path, 'w') as fp:
writer = csv.writer(fp, delimiter='\t', quotechar='"')
writer.writerows(j)
Pandas is perfect for this, especially if you want this to be easily adjusted to, say, other file formats. Of course one could consider it an overkill.
To just remove rows with empty cells:
>>> import pandas as pd
>>> data = pd.read_csv('example.csv', sep='\t')
>>> print data
A B C
0 1 2 5
1 NaN 1 9
2 3 4 4
>>> data.dropna()
A B C
0 1 2 5
2 3 4 4
>>> data.dropna().to_csv('example_clean.csv')
I leave performing the splitting and saving into separate files using pandas as an exercise to start learning this great package if you want :)
This would skip all all rows with at least one empty cell:
with open(main_file, "r") as fp:
....
for row in root:
if not all(map(len, row)):
continue
Pandas is Best in Python for handling any type of data processing.For help you can go through on link :- http://pandas.pydata.org/pandas-docs/stable/10min.html