Writing csv files with python with exact formatting parameters - python

I'm having trouble with processing some csv data files for a project. Someone suggested using python/csv reader to help break down the files, which I've had some success with, but not in a way I can use.
This code is a little different from what I was trying before. I am essentially attempting to create an array. In the raw data format, the first 7 rows contain no data, and then each column contains 50 experiments, each with 4000 rows, for 200000 some rows total. What I want to do is take each column, and make it an individual csv file, with each experiment in its own column. So it would be an array of 50 columns and 4000 rows for each data type. The code here does break down the correct values, I think the logic is okay, but it is breaking down the opposite of how I want it. I want the separators without quotes (the commas and spaces) and I want the element values in quotes. Right now it is doing just the opposite for both, element values with no quotes, and the separators in quotes. I've spent several hours trying to figure out how to do this to no avail,
import csv
ifile = open('00_follow_maverick.csv')
epistemicfile = open('00_follower_maverick_EP.csv', 'w')
reader = csv.reader(ifile)
colnum = 0
rownum = 0
y = 0
z = 8
for column in reader:
rownum = 4000 * y + z
for element in column:
writer = csv.writer(epistemicfile)
if y <= 50:
y = y + 1
writer.writerow([element])
writer.writerow(',')
rownum = x * y + z
if y > 50:
y = 0
z = z + 1
writer.writerow(' ')
rownum = x * y + z
if z >= 4008:
break
What is going on: I am taking each row in the raw data file in iterations of 4000, so that I can separate them with commas for the 50 experiments. When y, the experiment indicator here, reaches 50, it resets back to experiment 0, and adds 1 to z, which tells it which row to look at, by the formula of 4000 * y + z. When it completes the rows for all 50 experiments, it is finished. The problem here is that I don't know how to get python to write the actual values in quotes, and my separators outside of quotes.
Any help will be most appreciated. Apologies if this seems a stupid question, I have no programming experience, this is my first attempt ever. Thank you.
Sorry, I'll try to make this more clear. The original csv file has several columns, each of which are different sets of data.
A miniature example of the raw file looks like:
column1 column2 column3
exp1data1time1 exp1data2time1 exp1data3time1
exp1data1time2 exp1data2time2 exp1data3time2
exp2data1time1 exp2data2time1 exp2data3time1
exp2data1time2 exp2data2time2 exp2data3time2
exp3data1time1 exp3data2time1 exp3data3time1
exp3data1time2 exp3data2time2 exp3data3time2
So, the actual version has 4000 rows instead of 2 for each new experiment. There are 40 columns in the actual version, but basically, the data type in the raw file matches the column number. I want to separate each data type or column into an individual csv file.
This would look like:
csv file1
exp1data1time1 exp2data1time1 exp3data1time1
exp1data1time2 exp2data1time2 exp3data1time2
csv file2
exp1data2time1 exp2data2time1 exp3data2time1
exp1data2time2 exp2data2time2 exp3data2time2
csv file3
exp1data3time1 exp2data3time1 exp3data3time1
exp1data3time2 exp2data3time2 exp3data3time2
So, I'd move the raw data in the file to a new column, and each data type to its own file. Right now I'm only going to do one file, until I can move the separate experiments to separate columns in the new file. So, in the code, the above would make the 4000 into 2. I hope this makes more sense, but if not, I will try again.

If I had a cat for each time I saw a bio or psych or chem database in this state:
"each column contains 50 experiments,
each with 4000 rows, for 200000 some
rows total. What I want to do is take
each column, and make it an individual
csv file, with each experiment in its
own column. So it would be an array of
50 columns and 4000 rows for each data
type"
I'd have way too farking many cats.
I didn't even look at your code because the re-mangling you are proposing is just another problem that will have to be solved. I don't fault you, you claim to be a novice and all your peers make the same sort of error. Beginning programmers who have yet to understand how to use arrays often wind up with variable declarations like:
integer response01, response02, response03, response04, ...
and then very, very redundant code when they try to see if every response is - say - 1. I think this is such a seductive error in bio-informatics because it actually models the paper notations they come from rather well. Unfortunately, the sheet-of-paper model isn't the best way to model data.
You should read and understand why database normalization was developed, codified and has come to dominate how people think about structured data. One Wikipedia article may not be sufficient. Using the example I excerpted let me try to explain how I think of it. Your data consists of observations; put the other way the primary datum is a singular observation. That observation has a context though: it is one of a set of 4000 observations, where each set belongs to one of 50 experiments. If you had to attach a context to each observation you'd wind up with an addressing scheme that looks like:
<experiment_number, observation_number, value>
In database jargon, that's a tuple, and it is capable of representing, with no ambiguity and perfect symmetry the entirety of your data. I'm not certain that I've understood the exact structure of your data, so perhaps it is something more like:
<experiment_number, protocol_number, observation_number, value>
where the protocol may be some form of variable treatment type - let's say pH. But note that I didn't call the protocol a pH and I don't record it as such in the database. What I would then need is an ancillary table showing the relevant parameters of the protocol, e.g.:
<protocol_number, acidity, temperature, pressure>
Now we've just built a "relation" that those database people like to talk about; we've also begun normalizing the data. If you need to know the pH for a given protocol, there is one and only one place to find it, in the proper row of the protocol table. Note that I've divorced the data that fit so nicely together on a data-sheet and from the observation table I can't see the pH for a particular dataum. But that's okay, because I can just look it up in my protocol table if needed. This is a "relational join" and if I needed to, I could coalesce all the various parameters from all the various tables and reconstitute the original datasheet in its original, unstructured glory.
I hope this answer is of some use to you. I'm certain that I don't even know what field of study your data is from, but these principles apply across domains from drug trials to purchase requisition processing. Please understand that I'm trying to inform, per your request, and there is zero condescension intended. I welcome further questions on the matter.

Normalization of the dataset
Thanks for giving the example. You have the context I described already, perhaps I can make it more clear.
column1 column2 column3
exp1data1time1 exp1data2time1 exp1data3time1
exp1data1time2 exp1data2time2 exp1data3time2
The columns are an artifice made by the last guy; that is, they carry no relevant information. When parsed into a normal form, your data looks just like my first proposed tuple:
<experiment_number, time, response_number, response>
where I suspect time may actually mean "subject_id" or "trial_number". It may very well look incongruous to you to conjoin all the different response values into the same dataset; indeed based on your desired output, I suspect that it does. At first blush, the objection "but the subject's response to a question about epistemic properties of chairs has no connection to their meta-epistemic beliefs regarding color", but this would be mistaken. The data are related because they have a common experimental subject, and self-correlation is an important concept in sociological analytics.
For example, you may find that respondent A gives the same responses as respondent B, except all of A's responses are biased one higher because of how the subject understood the criteria. This would make a very real difference in the absolute values of the data, but I hope you can see that the question "do A and B actually have different epistemic models?" is salient and valid. One method of data modeling allows this question to be answered easily, your desired method does not.
Working parsing code to follow shortly.

The normalizing code
#!/usr/bin/python
"""parses a csv file containing a particular data layout and normalizes
The raw data set is a csv file of the form::
column1 column2 column3
exp01data01time01 exp01data02time01 exp01data03time01
exp01data01time02 exp01data02time02 exp01data03time02
where there are 40 such columns and the literal column title
is added as context to the output row
it is assumed that the columns are comma separated but
the lexical form of the subcolumns is unspecified.
Output will consist of a single CSV output stream
on stdout of the form::
exp01, time01, data01, column1
for varying actual values of each field.
"""
import csv
import sys
def split_subfields(s):
"""returns a list of subfields of s
this function is expected to be re-written to match the actual,
unspecified lexical structure of s."""
return [s[0:5], s[5:11], s[11:17]]
def normalise_data(reader, writer):
"""returns a list of the column headings from the reader"""
# obtain the headings for use in normalization
names = reader.next()
# get the data rows, split them out by column, add the column name
for row in reader:
for column, datum in enumerate(row):
fields = split_subfields(datum)
fields.append(names[column])
writer.writerow(fields)
def main():
if len(sys.argv) != 2:
print >> sys.stderr, ('usage: %s input.csv' % sys.argv[0])
sys.exit(1)
in_file = sys.argv[1]
reader = csv.reader(open(in_file))
writer = csv.writer(sys.stdout)
normalise_data(reader, writer)
if __name__ == '__main__': main()
Such that the command python epistem.py raw_data.csv > cooked_data.csv yields excerpted output looking like:
exp01,data01,time01,column1
...
exp01,data40,time01,column40
exp01,data01,time02,column1
exp01,data01,time03,column1
...
exp02,data40,time15,column40

Related

How to maintain incrementing values in a column after removing rows from a data frame

I work with a test system that outputs a large CSV matrix of values which I then process using the Pandas module in Python. The parameters that system uses when testing a given part are governed by a predetermined sequence. A simplified example is shown here:
Raw data frame
However, not all of these steps are desired in the output data. In fact, the rows containing a 'Clock Frequency' value of '3.0MHz' are only included to act as buffer points to allow a climate chamber to reach the intended temperature. I do not wish to include data collected at these parameters in my results.
I found I was pretty easily able to remove these rows from my data frame by using the below code. Note that in this example I am working with a Pandas data frame called 'csvDF'.
tempBuffers = csvDF[csvDF['Clock Frequency']==3e6].index
csvDF.drop(tempBuffers, inplace=True)
This produces the following output:
Data frame with buffer steps removed
The issue with this is that now my 'Sequence Step' column is wrong. I want the data table to appear as if those buffer steps never existed. The sequence steps should be sequential for all non-buffer steps. The desired output is shown below:
Data frame with buffer steps removed and corrected sequence step column
What code do I need to instantiate in order to achieve this?
You can try something like this:
n = 3 # number of rows in step
csvDF.reset_index(inplace=True, drop=True)
csvDF['Sequence step'] = pd.Series(range(len(csvDF)))
csvDF['Sequence step'] = csvDF['Sequence step'].apply(lambda x: int(x / n))

How can I periodically skip rows reading txt with pandas?

I need to process data measured every 20 seconds during the whole 2018 year, the raw file has following structure:
date time a lot of trash
in several rows
amount of samples trash again
data
date time a lot of trash
etc.
I want to make one pandas dataframe of it or at least one dataframe per every block (its size is coded as amount of samples) of data saving the time of measurement.
How can I ignore all other data trash? I know that it is written periodically (period = amount of samples), but:
- I don't know how many strings are in file
- I don't want to use explicit method file.getline() in cycle, because it would work just endlessly (especially in python) and I have no enough computing power to use it
Is there any method to skip rows periodically in pandas or another lib? Or how else can I resolve it?
There is an example of my data:
https://drive.google.com/file/d/1OefLwpTaytL7L3WFqtnxg0mDXAljc56p/view?usp=sharing
I want to get dataframe similar to datatable on the pic + additional column with date-time without technical rows
Use itertools.islice, where N below means read every N lines
from itertools import islice
N = 3
sep = ','
with open(file_path, 'r') as f:
lines_gen = islice(f, None, None, N)
df = pd.DataFrame([x.strip().split(sep) for x in lines_gen])
I repeated your data three times. It sounds like you need every 4th row (not starting at 0) because that is where your data lies. In the documentation for skipsrows it says.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
So what if we pass a not in to the lambda function? that is what I am doing below.
I am creating a list of the values i want to keep. and passing the not in to the skiprows argument. In English, skip all the rows that are not every 4th line.
import pandas as pd
# creating a list of all the 4th row indexes. If you need more than 1 million, just up the range number
list_of_rows_to_keep = list(range(0,1000000))[3::4]
# passing this list to the lambda function using not in.
df = pd.read_csv(r'PATH_To_CSV.csv', skiprows=lambda x: x not in list_of_rows_to_keep)
df.head()
#output
0 data
1 data
2 data
Just count how many lines are in file and put the list of them (may it calls useless_rows) which are supposed to be skiped in pandas.read_csv(..., skiprows=useless_rows).
My problem was a chip rows counting.
There are few ways to do it:
On Linux command "wc -l" (here is an instruction how to put it into your code: Running "wc -l <filename>" within Python Code)
Generators. I have a key in my relevant rows: it is in last column. Not really informative, but rescue for me. So I can count string with it, appears it's abour 500000 lines and it took 0.00011 to count
with open(filename) as f:
for row in f:
if '2147483647' in row:
continue
yield row

the replacement of converted columns after downcasting doesn't end

I'm working on my first correlation analysis. I've received the data through an excel file, I've imported it as Dataframe (had to pivot it) and now I have a set of almost 3000 rows and 25000 columns. I can't choose a subset from it, as every column is important for this project and I also don't know what information every column stores in order to choose the most interesting ones, because it is encoded with integer numbers (it is an university project). It is like a big questionnaire, where every person has his/hers own row and the answers for every question are stored in a different column.
I really need to solve this issue because later I'll have to replace the many Nans with the medians of the columns and then start the correlation analysis. I tried this part first and it didn't go because of the size so that's why I've tried downcasting first
The dataset has 600 MB and I used the downcasting instruction for the floats and saved 300 MB but when I try to replace the new columns in a copy of my dataset, it runs for 30 minutes and it doesn't do anything. No warning, no error until I interrupt the kernel and it still gives me no hint why it doesn't work.
I can't use the delete Nans instruction first, because there are so many, that it will erase almost everything.
#i've got this code from https://www.dataquest.io/blog/pandas-big-data/
def mem_usage(pandas_obj):
if isinstance(pandas_obj,pd.DataFrame):
usage_b = pandas_obj.memory_usage(deep=True).sum()
else: # we assume if not a df it's a series
usage_b = pandas_obj.memory_usage(deep=True)
usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
return "{:03.2f} MB".format(usage_mb)
gl_float = myset.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(gl_float)) #almost 600
print(mem_usage(converted_float)) #almost 300
optimized_gl = myset.copy()
optimized_gl[converted_float.columns]= converted_float #this doesn't end
after the replacement works, I want to use the Imputer function for the Nans-replacement and print the correlation result for my dataset
in the end I've decided to use this:
column1 = myset.iloc[:,0]
converted_float.insert(loc=0, column='ids', value=column1)
instead of the lines with optimized_gl and it solved it but it was possible only because every column changed except for the first one. So I just had to add the first to the others.

Is there a more efficient tool than iterrows() in this situation?

Okay so, here's the thing. I'm working with a lot of pandas data frames and arrays. Often times, I need to pair up a value from one frame with a value from another, ideally combining the information into one frame in the end.
Say I'm looking at image files. There's a set of information specific to each file. Sometimes there's certain types of image files that share the same kind of information. Simple example:
FILEPATH, TYPE, COLOR, VALUE_I,<br>
/img2.jpg, A, 'green', 0.6294<br>
/img45.jpg, B, 'green', 0.1846<br>
/img87.jpg, A, 'blue', 34.78<br>
Often, this information is indexed out by type/color/value etc and fed into some other function that gives me another important output, let's say VALUE_II. But I can't concatenate it directly onto the original dataframe because the indices won't match, either because of the nature of the output or because I only fed part of the frame.
Or another situation: I learn that images of a certain TYPE have a specific value attached to them, so I make a dictionary of types and their value. Again, this column doesn't exist, so in this case I would use iterrows() to march down the frame, see if the type matches a specific key, and if it does append it to an array. Then in the end, I convert that array to a dataframe and concatenate it onto the original.
Here's the worse offender. With up to 1800 rows in each frame, it takes FOREVER.:
newColumn = []
for index, row in originalDataframe.iterrows():
for indx, rw in otherDataframe.iterrows():
if row['filename'] in rw['filepath']:
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
newColumn = pd.DataFrame(newColumn, columns = ['VALUE_I', 'VALUE_II', 'VALUE_III'])
originalDataframe = pd.concat([originalDataframe, newColumn], axis=1)
Solutions would be appreciated!
If you can split filename from otherDataframe["filepath"], you can then just compare for equality with orinalDataframe's filename without need to check in. After that you can simplify calculation with pandas.DataFrame.join, which for each filename in originalDataframe will find the same filename in otherDataframe and add all other columns from it.
import os
otherDataframe["filename"] = otherDataframe["filepath"].map(os.path.basename)
joinedDataframe = originalDataframe.join(otherDataframe.set_index("filename"), on="filename")
If there are columns with the same name in originalDataframe and otherDataframe you should set lsuffix or rsuffix.
focusing on the second half of your question, as that's what you provided code for. Your program is checking every row of df1 against every row in df2, yielding potentially 1800 *1800, or 3240000 possible combinations. If there is only one possible match for each row then adding 'break' in will help some, but is not ideal.
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
break
if the structure of you data allows it, i would try something like:
ref = {}
for i, path in enumerate(otherDataframe['filepath']):
*_, file = path.split('\\')
ref[file] = i
originalDataframe['VALUE_I'] = None
originalDataframe['VALUE_II'] = None
originalDataframe['VALUE_III'] = None
for i, file in enumerate(originalDataframe['filename']):
try:
j = ref[file]
originalDataframe.loc[i, 'VALUE_I'] = otherDataframe.loc[j, 'VALUE_I']
originalDataframe.loc[i, 'VALUE_II'] = otherDataframe.loc[j, 'VALUE_II']
originalDataframe.loc[i, 'VALUE_III'] = otherDataframe.loc[j, 'VALUE_III']
except:
pass
Here we we iterate through the paths in otherDataframe (I assume they follow a pattern of C:\asdf\asdf\file), split the path on \ to pull out file, and then construct a dictionary of files to row numbers. Next we initialize the 3 columns in originalDataframe that you want to write to.
Lastly we iterate through the files in originalDataframe, check to see if that file exists in our dictionary of files in otherDataframe (done inside a try to catch errors), and pull the row number (out of the dictionary) which we then use to write the values from other to original.
Side note, you describe you paths as being in the vein of 'C:/asd/fdg/img2.jpg', in which case you should use:
*_, file = path.split('/')

pandas groupby is returning two groups for the same unique id

I have a large pandas dataframe, where I am running groups by operations.
CHROM POS Data01 Data02 ......
1 ....................
1 ...................
2 ..................
2 ............
scaf_9 .............
scaf_9 ............
So, i am doing:
my_data_grouped = my_data.groupby('CHROM')
for chr_, data in my_data_grouped:
do something in chr_
write something from that chr_ data
Everything is fine in small data and in the data where there is no string type CHROM i.e scaff_9. But, with very large data and with scaff_9, I am getting two groups of 2. It really isn't an error message and it is not affecting the computation. The issue is when I write the data by group in the file; I am getting two groups of 2 (splitted unequally).
It is becoming very hard for me to traceback the origin of this problem, since there is no error message and with small data it works well. My only assumption are:
Is there certain limit on the the number of lines in total dataframe vs. grouped dataframe the pandas module can handle. What is the fix to this problem ?
Among all the 2 most of them are treated as integer object and some (later part) as string object being close to scaff_9. Is this possible ?
Sorry, I am only making my assumption here, and it is becoming impossible for me to know the origin of the problem.
Post Edit:
I have also tried to run sort_by(['CHROM']) before doing to groupby, but the problem still persists.
Any possible fix to the issue.
Thanks,
In my opinion there is data problem, obviously some whitespaces, so pandas processes each group separately.
Solution should be remove traling whitespaces first:
df.index = df.index.astype(str).str.strip()
You can also check unique strings values of index:
a = df.index[df.index.map(type) == str].unique().tolist()
If first column is not index:
df['CHROM'] = df['CHROM'].astype(str).str.strip()
a = df.loc[df['CHROM'].map(type) == str, 'CHROM'].unique().tolist()
EDIT:
Last final solution was simplier - casting to str like:
df['CHROM'] = df['CHROM'].astype(str)

Categories

Resources