I have CSV file hello.csv with a cik-numbers in the third column. Now I have second file cik.csv where are the related companies (in column 4) to the cik-numbers (in column 1) and I want to have a list with the related companies to the cik-numbers in hello.csv.
I tried it with a loop:
with open('hello.csv', 'r', encoding="latin-1") as csvfile:
readCSV=csv.reader(csvfile, delimiter=',')
list1=list(readCSV)
b=-1
for j in list1:
b=b+1
if b>0:
cik=j[2]
with open('cik.csv', 'r', encoding="latin-1") as csvfile:
readCSV=csv.reader(csvfile, delimiter=',')
list2=list(readCSV)
I don't now how find my cik in the csv-file cik.csv and get the related company. Can I use pandas there ?
Use pandas to read in the two .csv files and map the respective values:
import pandas as pd
## create some dummy data
hello_csv="""
a,b,cik_numbers,d
'test',1,12, 5
'var', 6, 2, 0.1
"""
cik_csv="""
cik_numbers,b,c,related_companies
12,1,12, 'Apple'
13,6,20, 'Microsoft'
2,1,712,'Google'
"""
## note: you would rather give this a path to your csv files
# like: df_hello=pd.read_csv('/the/path/to/hello.csv')
df_hello=pd.read_csv(pd.compat.StringIO(hello_csv))
df_cik=pd.read_csv(pd.compat.StringIO(cik_csv))
## and add a new column to df_hello based on a mapping of cik_numbers
df_hello['related_companies'] = df_hello['cik_numbers'].map(df_cik.set_index('cik_numbers')['related_companies'])
print(df_hello)
yields:
a b cik_numbers d related_companies
0 'test' 1 12 5.0 'Apple'
1 'var' 6 2 0.1 'Google'
Related
I am a beginner and looking for a solution. I am trying to compare columns from two CSV files with no header. The first one has one column and the second one has two.
File_1.csv: #contains 2k rows with random numbers.
1
4
1005
.
.
.
9563
File_2.csv: #Contains 28k rows
0 [81,213,574,697,766,1074,...21622]
1 [0,1,4,10,12,13,1005, ...31042]
2 [35,103,85,1023,...]
3 [4,24,108,76,...]
4 []
.
.
.
28280 [0,1,9,10,32,49,56,...]
I want first to compare the column of File_1 with the first column of File_2 and find out if they match and extract the matching values plus the second column of file2 into a new CSV file (output.csv) deleting the not matching values. For example,
output.csv:
1 [0,1,4,10,12,13,1005, ...31042]
4 []
.
.
.
Second, I want to compare the File_1.csv column (iterate 2k rows) with the second column (each array) of the output.csv and find the matching values and delete the ones that do not, and I want to save those matching values into the output.csv file and also keeping the first column of that file. For example, 4 was deleted as it didn't have any values in the second column (array) as there were no numbers to compare to File_1, but there are others like 1 that did have some that match"
output.csv:
1 [1,4,1005]
.
.
.
I found a code that works for the first step, but it does not save the second column. I have been looking at how to compare arrays, but I haven't been able to.
This is what I have so far,
import csv
nodelist = []
node_matches = []
with open('File_1.csv', 'r') as f_rand_node:
csv_f = csv.reader(f_rand_node)
for row in csv_f:
nodelist.append(row[0])
set_node = set(nodelist)
with open('File_2.csv', 'r') as f_tbl:
with open('output.csv', 'w') as f_out:
csv_f = csv.reader(f_tbl)
for row in csv_f:
set_row = set(' '.join(row).split(' '))
if set_row.intersection(set_node):
node_match = list(set_row.intersection(set_node))[0]
f_out.write(node_match + '\n')
Thank you for the help.
I'd recommend to use pandas for this case.
File_1.csv:
1
4
1005
9563
File_2.csv:
0 [81,213,574,697,766,1074]
1 [0,1,4,10,12,13,1005,31042]
2 [35,103,85,1023]
3 [4,24,108,76]
4 []
5 [0,1,9,10,32,49,56]
Code:
import pandas as pd
import csv
file1 = pd.read_csv('File_1.csv', header=None)
file1.columns=['number']
file2 = pd.read_csv('File_2.csv', header=None, delim_whitespace=True, index_col=0)
file2.columns = ['data']
df = file2[file2.index.isin(file1['number'].tolist())] # first step
df = df[df['data'] != '[]'] # second step
df.to_csv('output.csv', header=None, sep='\t', quoting=csv.QUOTE_NONE)
Output.csv:
1 [0,1,4,10,12,13,1005,31042]
The entire thing is a lot easier with pandas DataFrames:
import pandas as pd
#Read the files into two dataFrames
df1= pd.read_csv("File_1.csv")
df2= pd.read_csv("File_2.csv")
df2.set_index("Column 0")
df2= df2.filter(items = df1)
index= df1.values()
df2 = df2.applymap(lambda x: set(x).intersection(index))
df.to_csv("output.csv")
This should do the trick, quite simply.
So I have a file that looks like this:
name,number,email,job1,job2,job3,job4
I need to convert it to one that looks like this:
name,number,email,job1
name,number,email,job2
name,number,email,job3
name,number,email,job4
How would I do this in Python?
As said in a comment that you can use pandas to read, write and manipulate csv file.
Here is one example of how you can solve your problem with pandas in python
import pandas as pd
# df = pd.read_csv("filename.csv") # read csv file from disk
# comment out below line when open from disk
df = pd.DataFrame([['ss','0152','ss#','student','others']],columns=['name','number','email','job1','job2'])
print(df)
this line output is
name number email job1 job2
0 ss 0152 ss# student others
Now we need to know how many columns are there:
x = len(df.columns)
print(x)
it will store the number of column in x
5
Now let's create a empty Dataframe with columns= [name,number,email,job]
c = pd.DataFrame(columns=['name','number','email','job'])
print(c)
output:
Columns: [name, number, email, job]
Index: []
Now we use loop from range 3 to end of the column and concat datafarme with our empty dataframe:
for i in range(3,x):
df1 = df.iloc[:,0:3].copy() # we took first 3 column
df2 = df.iloc[:,[i]].copy() # we took ith coulmn
df1['job'] = df2; # added ith coulmn to the df1
c = pd.concat([df1,c]); # concat df1 and c
print(c)
output:
name number email job
0 ss 0152 ss# others
0 ss 0152 ss# student
Dataframe c has your desired output. Now you can save it using
c.to_csv('ouput.csv')
Let's assume this is the dataframe:
import pandas as pd
df = pd.DataFrame(columns=['name','number','email','job1','job2','job3','job4'])
df = df.append({'name':'jon', 'number':123, 'email':'smth#smth.smth', 'job1':'a','job2':'b','job3':'c','job4':'d'},ignore_index=True)
We define a new dataframe:
new_df = pd.DataFrame(columns=['name','number','email','job'])
Now, we loop over the old one to split it based on the jobs. I assume you have 4 jobs to split:
for i, row in df.iterrows():
for job in range(1,5):
job_col = "job" + str(job)
new_df = new_df.append({'name':row['name'], 'number':row['number'], 'email':row['email'], 'job':row[job_col]}, ignore_index=True)
You can use the csv module and Python's unpacking syntax to get the data from the input file and write it to the output file.
import csv
with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
# Skip header row, if necessary
next(reader)
# Use sequence unpacking to get the fixed variables and
# and arbitrary number of "jobs".
for name, number, email, *jobs in reader:
for job in jobs:
writer.writerow([name, number, email, job])
Below:
with open('input.csv') as f_in:
lines = [l.strip() for l in f_in.readlines()]
with open('output.csv','w') as f_out:
for idx,line in enumerate(lines):
if idx > 0:
fields = line.split(',')
for idx in range(3,len(fields)):
f_out.write(','.join(fields[:3]) + ',' + fields[idx] + '\n')
input.csv
header row
name,number,email,job1,job2,job3,job4
name1,number1,email1,job11,job21,job31,job41
output.csv
name,number,email,job1
name,number,email,job2
name,number,email,job3
name,number,email,job4
name1,number1,email1,job11
name1,number1,email1,job21
name1,number1,email1,job31
name1,number1,email1,job41
How can I split a large csv with many columns, based on changing one column e.g ID? here is an example:
import pandas as pd
from pandas.compat import StringIO
csvdata = StringIO("""ID,f1
1,3.2
1,4.3
1,10
7,9.1
7,2.3
7,4.4
""")
df = pd.read_csv(csvdata, sep=",")
df
My aim is to save each block in separate csv which its name is generated in a loop based on ID:
df_ID_1.csv
ID f1
1 3.2
1 4.3
1 10.0
df_ID_7.csv
ID f1
7 9.1
7 2.3
7 4.4
Thank you very much!
just cycle through the IDs, create a sliced dataframe for each one, and create your .csv file
for id in df['ID'].unique():
temp_df = df.loc[df['ID'] == id]
file_name = "df_ID_{}".format(id)
# make the path to where you want it saved
file_path = "C:/Users/you/Desktop/" + file_name
# write the single ID dataframe to a csv
temp_df.to_csv(file_path)
You can use the groupby method for this and acces each seperate group and write it to a csv using pandas.to_csv.
for _, r in df.groupby('ID'):
r.to_csv(f'df_ID_{r.ID.iloc[0]}')
Or if your Python version is < 3.5 use .format for string formatting instead of f-string:
for _, r in df.groupby('ID'):
r.to_csv('df_ID_{}.csv'.format(r.ID.iloc[0]))
Which splits our dataframe in seperate csv's:
Explanation of the loop we use:
for _, r in df.groupby('ID'):
print(r, '\n')
print(f'This is our ID {r.ID.iloc[0]}', '\n')
ID f1
0 1 3.2
1 1 4.3
2 1 10.0
This is our ID 1
ID f1
3 7 9.1
4 7 2.3
5 7 4.4
This is our ID 7
Without using Pandas: read the file using the csv module, sort by the specified column, groupby the specified column using the itertools module, iterate over the groups and write new files.
import itertools, csv
key = operator.itemgetter('ID')
# assumes csvdata is a filelike object (io.StringIO in OP's example)
reader = csv.DictReader(csvdata)
fields = reader.fieldnames
data = sorted(reader, key = key)
for key,group in itertools.groupby(data, key):
with open(f'ID_{key}.csv', 'w')as f:
writer = csv.DictWriter(f, fields)
writer.writeheader()
writer.writerows(group)
I have a csv file, which has only a single column , which acts as my input.
I use that input to find my outputs. I have multiple outputs and I need those outputs in another csv file.
Can anyone please suggest me the ways on how to do it ?
Here is the code :
import urllib.request
jd = {input 1}
//
Some Codes to find output - a,b,c,d,e
//
** Code to write output to a csv file.
** Repeat the code with next input of input csv file.
Input CSV File has only a single column and is represented below:
1
2
3
4
5
Output would in a separate csv in a given below format :
It would be in multiple rows and multiple columns format.
a b c d e
Here is a simple example:
The data.csv is a csv with one column and multiple rows.
The results.csv contain the mean and median of the input and is a csv with 1 row and 2 columns (mean is in 1st column and median in 2nd column)
Example:
import numpy as np
import pandas as pd
import csv
#load the data
data = pd.read_csv("data.csv", header=None)
#calculate things for the 1st column that has the data
calculate_mean = [np.mean(data.loc[:,0])]
calculate_median = [np.median(data.loc[:,0])]
results = [calculate_mean, calculate_median]
#write results to csv
row = []
for result in results:
row.append(result)
with open("results.csv", "wb") as file:
writer = csv.writer(file)
writer.writerow(row)
In pseudo code, you'll do something like this:
for each_file in a_folder_that_contains_csv: # go through all the `inputs` - csv files
with open(each_file) as csv_file, open(other_file) as output_file: # open each csv file, and a new csv file
process_the_input_from_each_csv # process the data you read from the csv_file
export_to_output_file # export the data to the new csv file
Now, I won't write a full-working example because it's better for you to start digging and ask specific questions when you have some. You're now just asking: write this for me because I don't know Python.
here is the official documentation
here you can read about the csv module
here you can read about the os module
I think you need read_csv for reading file to Series and to_csv for writing output Series to file in looping by Series.iteritems.
#file content
1
3
5
s = pd.read_csv('file', squeeze=True, names=['a'])
print (s)
0 1
1 3
2 5
Name: a, dtype: int64
for i, val in s.iteritems():
#print (val)
#some operation with scalar value val
df = pd.DataFrame({'a':np.arange(val)})
df['a'] = df['a'] * 10
print (df)
#write to csv, file name by val
df.to_csv(str(val) + '.csv', index=False)
a
0 0
a
0 0
1 10
2 20
a
0 0
1 10
2 20
3 30
4 40
I am splitting a CSV file based on a column with dates into separate files. However, some rows do contain a date but the others cells are empty. I want to remove these rows that contain empty cells from the CSV. But I'm not sure how to do this.
Here's is my code:
csv.field_size_limit(sys.maxsize)
with open(main_file, "r") as fp:
root = csv.reader(fp, delimiter='\t', quotechar='"')
result = collections.defaultdict(list)
next(root)
for row in root:
year = row[0].split("-")[0]
result[year].append(row)
for i,j in result.items():
row_count = sum(1 for row in j)
print(row_count)
file_path = "%s%s-%s.csv"%(src_path, i, row_count)
with open(file_path, 'w') as fp:
writer = csv.writer(fp, delimiter='\t', quotechar='"')
writer.writerows(j)
Pandas is perfect for this, especially if you want this to be easily adjusted to, say, other file formats. Of course one could consider it an overkill.
To just remove rows with empty cells:
>>> import pandas as pd
>>> data = pd.read_csv('example.csv', sep='\t')
>>> print data
A B C
0 1 2 5
1 NaN 1 9
2 3 4 4
>>> data.dropna()
A B C
0 1 2 5
2 3 4 4
>>> data.dropna().to_csv('example_clean.csv')
I leave performing the splitting and saving into separate files using pandas as an exercise to start learning this great package if you want :)
This would skip all all rows with at least one empty cell:
with open(main_file, "r") as fp:
....
for row in root:
if not all(map(len, row)):
continue
Pandas is Best in Python for handling any type of data processing.For help you can go through on link :- http://pandas.pydata.org/pandas-docs/stable/10min.html