Mapping CSV Header using a Dictionary - python

I have a reference file that looks like this:
Experiment,Array,Drug
8983,Genechip,Famotidine
8878,Microarray,Dicyclomine
8988,Genechip,Etidronate
8981,Microarray,Flunarizine
I successfully created a dictionary mapping the Experiment numbers to the Drug name using the following:
reader = csv.reader(open('C:\Users\Troy\Documents\ExPSRef.txt'))
#Configure dictionary
result = {}
for row in reader:
key = row[0]
result[key] = row[2]
di = result
I want to map this dictionary to the header of another file which consists of the experiment number. It currently looks like this:
Gene,8988,8981,8878,8983
Vcp,0.011,-0.018,-0.032,-0.034
Ube2d2,0.034,0.225,-0.402,0.418
Becn1,0.145,-0.108,-0.421,-0.048
Lypla2,-0.146,-0.026,-0.101,-0.011
But it should look like this:
Gene,Etidronate,Flunarizine,Dicyclomine,Famotidine
Vcp,0.011,-0.018,-0.032,-0.034
Ube2d2,0.034,0.225,-0.402,0.418
Becn1,0.145,-0.108,-0.421,-0.048
Lypla2,-0.146,-0.026,-0.101,-0.011
I tried using:
import csv
import pandas as pd
reader = csv.reader(open('C:\Users\Troy\Documents\ExPSRef.txt'))
result = {}
for row in reader:
key = row[0]
result[key] = row[2]
di = result
df = pd.read_csv('C:\Users\Troy\Documents\ExPS2.txt')
df['row[0]'].replace(di, inplace=True)
but it returned a KeyError: 'row[0]'.
I tried the following as well, even transposing in order to merge:
import pandas as pd
df1 = pd.read_csv('C:\Users\Troy\Documents\ExPS2.txt',).transpose()
df2 = pd.read_csv('C:\Users\Troy\Documents\ExPSRef.txt', delimiter=',', engine='python')
df3 = df1.merge(df2)
df4 = df3.set_index('Drug').drop(['Experiment', 'Array'], axis=1)
df4.index.name = 'Drug'
print df4
and this time received MergeError('No common columns to perform merge on').
Is there a simpler way to map my dictionary to the header that would work?

One of the things to keep in mind would be to making sure that both the keys corresponding to the mapper dictionary as well as the header which it is mapped to are of the same data type.
Here, one is a string and the other of integer type. So while reading itself, we'll let it not interpret dtype by setting it to str for the reference DF.
df1 = pd.read_csv('C:\Users\Troy\Documents\ExPS2.txt') # Original
df2 = pd.read_csv('C:\Users\Troy\Documents\ExPSRef.txt', dtype=str) # Reference
Convert the columns of the original DF to it's series representation and then replace the old value which were Experiment Nos. with the new Drug name retrieved from the reference DF.
df1.columns = df1.columns.to_series().replace(df2.set_index('Experiment').Drug)
df1

I used csv for the whole script. This fixes the header you wanted and saves into a new file. The new filename can be replaced with the same one if that's what you prefer. This program is written with python3.
import csv
with open('sample.txt', 'r') as ref:
reader = csv.reader(ref)
# skip header line
next(reader)
# make dictionary
di = dict([(row[0], row[2]) for row in reader])
data = []
with open('sample1.txt', 'r') as df:
reader = csv.reader(df)
header = next(reader)
new_header = [header[0]] + [di[i] for i in header if i in di]
data = list(reader)
# used to make new file, can also replace with the same file name
with open('new_sample1.txt', 'w') as df_new:
writer = csv.writer(df_new)
writer.writerow(new_header)
writer.writerows(data)

Related

Python: use the arguments in enumerate as variable names for the nested loop

I would like to generate two data frames (and subsequently export to CSV) from two CSV files. I come up with the following (incomplete) code, which focuses on dealing with a.csv. I create an empty data frame (df_a) to store rows from itterows iteration (df_b is missing).
The problem is I do not know how to process b.csv without manually describing all avariables of empty dataframes in advance (i.e. df_a = pd.DataFrame(columns=['start', 'end']) and df_b = pd.DataFrame(columns=['start', 'end'])).
I hope I can use the arguments of enumerate (ie. the content of file) as variables (ie. something like df_file) for the data frames (instead of df_a and df_b).
list_files = [a.csv, b.csv]
for i, file in enumerate(list_file):
df = pd.read_csv(file)
# Create empty data frame to store data for each iteration below
df_a = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_a = pd.concat([df_a, df_dicts], ignore_index=True)
df_a_csv = df_a.to_csv('df_a.csv')
Ideally, it could look a bit like (note: file is used as a part of variable name df_file)
list_files = [a.csv, b.csv]
for i, file in enumerate(list_file):
df = pd.read_csv(file)
# Create empty data frame to store data for each iteration below
df_file = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_file = pd.concat([df_file, df_dicts], ignore_index=True)
df_file_csv = df_file.to_csv('df_' + file + '.csv')
Different approaches are also welcome. I just need to save the dataframe outcome for each input file. Many Thanks!
SomeFunction(var) aside, can you get the result you seek without pandas for the most part?
import csv
import pandas
## -----------
## mocked
## -----------
def SomeFunction(var):
return None
## -----------
list_files = ["a.csv", "b.csv"]
for file_path in list_files:
with open(file_path, "r") as file_in:
results = []
for row in csv.DictReader(file_in):
df_new = SomeFunction(row['name'])
start, end = df_new['column1'], df_new['column2']
results.append({"start": start, "end": end})
with open(f"df_{file_path}", "w") as file_out:
writer = csv.DictWriter(file_out, fieldnames=list(results[0].keys())):
writer.writeheader()
writer.writerows(results)
Note that you can also stream rows from the input to the output if you would rather not read them all into memory.
There are many things we could comment, but I understand that you are concerned about not having to specify the loop for a and for b, given that you already are doing it in list_files.
If this is the issue, what about doing something like this?
# CHANGED list only the stem of the base name, we will use them for many things
file_name_stems = ["a", "b"]
# CHANGED we save a dictionary for the dataframes
dataframes = {}
# CHANGED did you really need the enumerate?
for file_stem in file_name_stems:
filename = file_stem + ".csv"
df = pd.read_csv(filename)
# Create empty data frame to store data for each iteration below
# CHANGED let's use df_x as a generic name. Knowing your code, you will surely find better names
df_x = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_x = pd.concat([df_a, df_dicts], ignore_index=True)
# CHANGED and now, we print to the file
csv_x = df_x.to_csv(f'df_{file_stem}.csv')
# CHANGED and save it to a dictionary in case you need it
dataframes[stem] = csv_x
So, instead of listing the exact filenames, you can list the stem of their name, and then compose de source filename and the output one.
Another option could be to list the source filenames and replace some part of the filename to generate the output filename:
list_files = ["a.csv", "b.csv"]
for filename in list_files:
# ...
output_file_name = filename.replace(".csv", "_df.csv")
# this produces "a_df.csv" and "b_df.csv"
Does any of this look to solve your problem? :)

Split values in CSV that look like JSON

So I have a CSV file with a column called content. However, the contents in column look like it is based on JSON, and, therefore, house more columns. I would like to split these contents into multiple columns or extract the final part of it after "value". See picture below to see an example of the file. Any ideas how to get this? I would prefer using Python. I don't have any experience with JSON.
Using pandas you could do in a simpler way.
EDIT updated to handle the single quotes:
import pandas as pd
import json
data = pd.read_csv('test.csv', delimiter="\n")["content"]
res = [json.loads(row.replace("'", '"')) for row in data]
result = pd.DataFrame(res)
result.head()
# Export result to CSV
result.to_csv("result.csv")
my csv:
result:
This script will create a new csv file with the 'value' added to the csv as an additional column
(make sure that the input_csv and output_csv are different filenames)
import csv
import json
input_csv = "data.csv"
output_csv = "data_updated.csv"
values = []
with open(input_csv) as f_in:
dr = csv.DictReader(f_in)
for row in dr:
value = json.loads(row["content"].replace("'", '"'))["value"]
values.append(value)
with open(input_csv) as f_in:
with open(output_csv, "w+") as f_out:
w = csv.writer(f_out, lineterminator="\n")
r = csv.reader(f_in)
all = []
row = next(r)
row.append("value")
all.append(row)
i = 0
for row in r:
row.append(values[i])
all.append(row)
i += 1
w.writerows(all)

deleting useless columns and rows in csvfile and save using python

Here's a sample csv file;
out_gate,uless_col,in_gate,n_con
p,x,x,1
p,x,y,1
p,x,z,1
a_a,u,b,1
a_a,s,b,3
a_b,e,a,2
a_b,l,c,4
a_c,e,a,5
a_c,s,b,5
a_c,s,b,3
a_c,c,a,4
a_d,o,c,2
a_d,l,c,3
a_d,m,b,2
p,y,x,1
p,y,y,1
p,y,z,3
I want to remove the useless columns (2nd column) and useless rows (first three and last three rows) and create a new csv file and then save this new one. and How can I deal with the csv file that has more than 10 useless columns and useless rows?
(assuming useless rows are located only on the top or the bottom lines not scattered in the middle)
(and I am also assuming all the rows we want to use has its first element name starting with 'a_')
Can I get solution without using numpys or pandas as well? thanks!
Assuming that you have one or more unwanted columns and the wanted rows start with "a_".
import csv
with open('filename.csv') as infile:
reader = csv.reader(infile)
header = next(reader)
data = list(reader)
useless = set(['uless_col', 'n_con']) # Let's say there are 2 useless columns
mask, new_header = zip(*[(i,name) for i,name in enumerate(header)
if name not in useless])
#(0,2) - column mask
#('out_gate', 'in_gate') - new column headers
new_data = [[row[i] for i in mask] for row in data] # Remove unwanted columns
new_data = [row for row in new_data if row[0].startswith("a_")] # Remove unwanted rows
with open('filename.csv', 'w') as outfile:
writer = csv.writer(outfile)
writer.writerow(new_header)
writer.writerows(new_data)
You can try this:
import csv
data = list(csv.reader(open('filename.csv')))
header = [data[0][0]]+data[0][2:]
final_data = [[i[0]]+i[2:] for i in data[1:]][3:-3]
with open('filename.csv', 'w') as f:
write = csv.writer(f)
write.writerows([header]+final_data)
Output:
out_gate,in_gate,n_con
a,b,1
a,b,3
b,a,2
b,c,4
c,a,5
c,b,5
c,b,3
c,a,4
d,c,2
d,c,3
d,b,2
Below solution uses Pandas.
As the pandas dataframe drop function suggests, you can do the following:
import pandas as pd
df = pd.read_csv("csv_name.csv")
df.drop(columns=['ulesscol'])
Above code is considering dropping columns, you can drop rows by index as:
df.drop([0, 1])
Alternatively, don't read in the column in the first place:
df = pd.read_csv("csv_name.csv",
usecols=["out_gate", "in_gate", "n_con"])

Python: extracting data values from one file with IDs from a second file

I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')

How can I call columns with column names in Python

I have hundreds of csv files. Each one need not necessarily have the same headers like below.
CSV1:
G,B,C,D
1,2,3,4
2,4,5,6
CSV2:
A,C,D
1,2,6
2,5,7
I'd like to call each column by its name, like this:
if file has column A: select that column
else: skip to the next required column (which could be B) and repeat the same process for each file until all the required columns have been referenced. I'd really really appreciate if you could help me do this.
Call following function with the desired column name and it returns list of all values belonged to this column:
import csv
file = 'c:\\temp\\test.csv'
def GetValuesFromColumn(title):
values = []
rownum = 0
with open(file, 'r') as f:
reader = csv.reader(f)
for row in reader:
if rownum == 0:
index = row.index(title)
rownum = 1
else:
values.append(row[index])
return values
values = GetValuesFromColumn('D')
Solution 1: use the csv module's DictReader.
Solution 2: if you data is really numerical, as in your example, you could use numpy.genfromtxt to produced structured arrays: http://docs.scipy.org/doc/numpy/user/basics.rec.html
if you'd like a Pandas approach, here's an option. It opens each file, grabs one row, looks at the columns and sees if any of the column names are in the desired list. If there are any columns we want it reads the csv into a Pandas DataFrame
example data:
df = pd.DataFrame( [(2014, 30, 15), (2015, 10, 20), (2007, 5, 3)] )
df.columns = ['year','a','b']
df.set_index('year', inplace=True)
df.to_csv('tst.csv')
df.columns = ['c','z']
df.to_csv('tst2.csv')
Do the work:
import glob
wanted = ['year','a','z']
path = '.'
allFiles = glob.glob(path + "/*.csv")
for file in allFiles:
#grab only one row for testing
df = pd.read_csv(file, nrows=1)
includedCols = []
for x in wanted:
if x in df.columns:
includedCols.append(x)
if len(includedCols) > 0:
df = pd.read_csv(file, usecols = includedCols)
print df
## do something with df here
Use csv.DictReader()
Create an object which operates like a regular reader but maps the information read into a dict whose keys are given by the optional fieldnames parameter.
If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
For example
import csv
def get_values_for_column(csvfile, col):
with open(csvfile, 'rb') as f:
reader = csv.DictReader(f)
values = [row[col] for row in reader]
return values
# Usage
>> get_values_for_column('CSV1.csv', 'D')
# prints [4, 6]
Based on the answers from both Alan and Aldervan

Categories

Resources