There are as many as 1440 files in one directory to be read with Python. File names have a pattern as
HMM_1_1_.csv
HMM_1_2_.csv
HMM_1_3_.csv
HMM_1_4_.csv
...
and for HMM_i_j_.csv, i goes from 1 to 144 and j goes from 1 to 10.
How can I import each of them into a variable named HMM_i_j similar to its original name?
For instance, HMM_140_8_.csv should be imported as variable HMM_140_8.
You can do this by using pandas and a dictionary. Here is the script that would probably do what you want.
In order to access to a specific csv file in python environment, just use i.e csv[HMM_5_7].
import pandas as pd
csv = {}
for i in range(1, 145):
for j in range(1, 11):
s = 'HMM_{}_{}'.format(i,j)
csv[s] = pd.read_csv(s+'.csv')
Or: (shorter)
d = {}
for i in range(1440):
s = 'HMM_{}_{}'.format(i//10+1,i%10+1)
d[s] = pd.read_csv(s+'.csv')
Or a less readable one-liner:
d = {'HMM_{}_{}'.format(i//10+1,i%10+1):
pd.read_csv('HMM_{}_{}.csv'.format(i//10+1,i%10+1)) for i in range(1440)}
Instead of putting them in variables with this name, you can create a dictionary where the key is the name minus '_.csv" and the value is the content of the file.
Here are the steps, I let you figure out how to exactly do each step:
Create an empty dictionary
Loop i from 1 to 144 and j from 1 to 10
If the corresponding file exists, read it and put its content in the dictionary at the corresponding key
Related
I'm working on cs50's pset6, DNA, and I want to read a csv file that looks like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
But the problem is that dictionaries only have a key, and a value, so I don't know how I could structure this. What I currently have is this piece of code:
import sys
with open(argv[1]) as data_file:
data_reader = csv.DictReader(data_file)
And also, my csv file has multiple columns and rows, with a header and the first column indicating the name of the person. I don't know how to do this, and I will later need to access the individual amount of say, Alice's value of AATG.
Also, I'm using the module sys, to import DictReader and also reader
You can always try to create the function on your own.
You can use my code here:
def csv_to_dict(csv_file):
key_list = [key for key in csv_file[:csv_file.index('\n')].split(',')] # save the keys
data = {} # every dictionary
info = [] # list of dicitionaries
# for each line
for line in csv_file[csv_file.index('\n') + 1:].split('\n'):
count = 0 # this variable saves the key index in my key_list.
# for each string before comma
for value in line.split(','):
data[key_list[count]] = value # for each key in key_list (which I've created before), I put the value. This is the way to set a dictionary values.
count += 1
info.append(data) # after updating my data (dictionary), I append it to my list.
data = {} # I set the data dictionary to empty dictionary.
print(info) # I print it.
### Be aware that this function prints a list of dictionaries.
I'm having some trouble figuring out the best implementation
I have data in file in this format:
|serial #|machine_name|machine_owner|
If a machine_owner has multiple machines, I'd like the machines displayed in a comma separated list in the field. so that.
|1234|Fred Flinstone|mach1|
|5678|Barney Rubble|mach2|
|1313|Barney Rubble|mach3|
|3838|Barney Rubble|mach4|
|1212|Betty Rubble|mach5|
Looks like this:
|Fred Flinstone|mach1|
|Barney Rubble|mach2,mach3,mach4|
|Betty Rubble|mach5|
Any hints on how to approach this would be appreciated.
You can use dict as temporary container to group by name and then print it in desired format:
import re
s = """|1234|Fred Flinstone|mach1|
|5678|Barney Rubble|mach2|
|1313|Barney Rubble||mach3|
|3838|Barney Rubble||mach4|
|1212|Betty Rubble|mach5|"""
results = {}
for line in s.splitlines():
_, name, mach = re.split(r"\|+", line.strip("|"))
if name in results:
results[name].append(mach)
else:
results[name] = [mach]
for name, mach in results.items():
print(f"|{name}|{','.join(mach)}|")
You need to store all the machines names in a list. And every time you want to append a machine name, you run a function to make sure that the name is not already in the list, so that it will not put it again in the list.
After storing them in an array called data. Iterate over the names. And use this function:
data[i] .append( [ ] )
To add a list after each machine name stored in the i'th place.
Once your done, iterate over the names and find them in in the file, then append the owner.
All of this can be done in 2 steps.
This is and example of what my csv file looks like with 6 columns:
0.0028,0.008,0.0014,0.008,0.0014,0.008,
I want to create 6 variables to use later in my program using these numbers as the values; however, the number of columns WILL vary depending on exactly which csv file I open.
If I were to do this manually and the number of columns was always 6, I would just create the variables like this:
thickness_0 = (row[0])
thickness_1 = (row[1])
thickness_2 = (row[2])
thickness_3 = (row[3])
thickness_4 = (row[4])
thickness_5 = (row[5])
Is there a way to create these variables with a for loop so that it is not necessary to know the number of columns? Meaning it will create the same number of variables as there are columns?
There are ways to do what you want, but this is considered very bad practice, you better never mix your source code with data.
If your code depends on dynamic data from "outer world", use dictionary (or list in your case) to access your data programatically.
You can use a dictionary
mydict = {}
with open('StackupThick.csv', 'r') as infile:
reader = csv.reader(infile, delimiter=',')
for idx, row in enumerate(reader):
key = "thickness_" + str(idx)
mydict[key] = row
Call your values like this:
print(mydict['thickness_3'])
From your question, I understand that your csv files have only one line with the comma separated values. If so, and if you are not fine with dictionares (as in #Mike C. answers) you can use globals() to add variables to the global namespace, which is a dict.
import csv
with open("yourfile.csv", "r", newline='') as yourfile:
rd = csv.reader(yourfile, delimiter=',')
row = next(rd)
for i, j in enumerate(row):
globals()['thickness_' + str(i)] = float(j)
Now you have whatever number of new variables called thickness_i where i is a number starting from 0.
Please be sure that all the values are in the first line of the csv file, as this code will ignore any lines beyond the first.
I have created a code that imports data via .xlrd in two directories in Python.
Code:
import xlrd
#category.clear()
#term.clear()
book = xlrd.open_workbook("C:\Users\Koen\Google Drive\etc...etc..")
sheet = book.sheet_by_index(0)
num_rows = sheet.nrows
for i in range(1,num_rows,1):
category = {i:( sheet.cell_value(i, 0))}
term = {i:( sheet.cell_value(i, 1))}
When I open one of the two directories (category or term), it will present me with a list of values.
print(category[i])
So far, so good.
However, when I try to open an individual value
print(category["2"])
, it will consistently give me an error>>
Traceback (most recent call last):
File "testfile", line 15, in <module>
print(category["2"])
KeyError: '2'
The key's are indeed numbered (as determined by i).
I've already tried to []{}""'', etc etc. Nothing works.
As I need those values later on in the code, I would like to know what the cause of the key-error is.
Thanks in advance for taking a look!
First off, you are reassigning category and term in every iteration of the for loop, this way the dictionary will always have one key at each iteration, finishing with the last index, so if our sheet have 100 lines, the dict will only have the key 99. To overcome this, you need to define the dictionary outside the loop and assign the keys inside the loop, like following:
category = {}
term = {}
for i in range(1, num_rows, 1):
category[i] = (sheet.cell_value(i, 0))
term[i] = (sheet.cell_value(i, 1))
And second, the way you are defining the keys using the for i in range(1, num_rows, 1):, they are integers, so you have to access the dictionary keys like so category[1]. To use string keys you need to cast them with category[str(i)] for example.
I hope have clarifying the problem.
In R there is a function called assign which assigns a value to a name in the environment.
EG:
assign("Hello", 2)
> Hello
[1] 2
In python I can't seem to do the same. I initially tried:
import numpy as np
import pandas as pd
import os
for file in os.listdir('C:\\Users\\Olivia\\Documents'):
if file.endswith(".csv"):
os.path.splitext(file)[0] = pd.read_csv('C:\\Users\\Olivia\\Documents\\' + file)
But I can see this is trying to make a string equal to a file which doesn't work.
I managed to get all the files in a list by doing:
import glob
dl = glob.glob(r'C:\Users\Olivia\Documents\*.csv')
nl = []
for i in dl:
pl = i.split(os.sep)
name = pl[5][:-4]
nl.append(name)
ddict = {}
for k, v in zip(nl,dl):
ddict[k] = ddict.get(k,"") + v
dfl = []
for k, v in ddict.items():
dfl.append(read_csv(v))
But now how do I get each data frame out of the list and named as the file without the extension. There must be a way to assign each data frame in the list as a name from the file list
Honestly, you were on the right track with your first method. Unfortunately, python doesn't give you the option to create a "variable number of variables" dynamically, as you have tried and realised already. However! You can create a dictionary and assign dataframes to string keys as you like. Here's how.
root = 'C:\\Users\\Olivia\\Documents'
ddict = {}
for file in os.listdir(root):
if file.endswith(".csv"):
name = os.path.splitext(file)[0]
ddict[name] = pd.read_csv(os.path.join(root, file))
Another way of building this dictionary is using a dict comprehension:
ddict = {os.path.splitext(file)[0] : pd.read_csv(os.path.join(root, file))
for file in os.listdir(root) if file.endswith('csv')
}
Now, referring to a single dataframe is as easy as
ddict['your_file_name']
Another thing to note, the safest way to join files is using os.path.join. It's just safer than a plain +.
References
How do I create a variable number of variables?
why use os.path.join over string concatenation