Writing data from one CSV file to another CSV file using Python - python

I have a CSV file with two rows of data. The first row are the names like Red, Green, Orange, Purple and they repeat themselves like that. The second row is the data. The format is how I have written below, but in a CSV file. I want to take this and put them in separate columns like I have shown in table 2 but again in a CSV file. How do I combine the similar names and keep the data for it all? I understand I could write them out like this
lista1=["Red", "Green", "Orange", "Purple"]
lista2=[3,56,23,12,34,65,98,7,9,45,33,15]
and call on them although I have 100's of files like this and I can't change the numbers and titles each time
Table 1:
Red
Green
Orange
Purple
Red
Green
Orange
Purple
Red
Green
Orange
Purple
3
56
23
12
34
65
98
7
9
45
33
15
Table 2 (output):
Red
Green
Orange
Purple
3
56
23
12
34
65
98
7
9
45
33
15
Again the table 1 data is from CSV file and I want the desired output in a CSV file as well.

Since, you do not need pandas in your solution, here is one that only uses csv module.
I read the file using csv.reader() function. Converted the data into dictionary according to the sample input csv file you provided and then converted that dictionary into csv file.
Here is the sample csv input file :-
Red,Green,Orange,Purple,Red,Green,Orange,Purple,Red,Green,Orange,Purple
3,56,23,12,34,65,98,7,9,45,33,15
Now the code:-
import csv
with open('try.csv') as csvfile:
mixedData = csv.reader(csvfile)
column,data = mixedData
data_dict = {}
for i,name in enumerate(column) :
if name in data_dict :
data_dict[name].append(data[i])
else :
data_dict[name] = [data[i]]
with open("try_output.csv", "w",newline="") as outfile:
writer = csv.writer(outfile)
writer.writerow(data_dict.keys())
writer.writerows(zip(*data_dict.values()))
output file :-
Red,Green,Orange,Purple
3,56,23,12
34,65,98,7
9,45,33,15

This is how I would do it. This assumes that the first row is always going to be words of some kind, and the bottom will always be numbers. As long as that's true, you don't need to know what the words are ahead of time.
First, read the data from the csv file. (I'm not reading it directly into a dataframe because column names need to be unique.)
>>> import pandas as pd
>>> import re
>>> infile = '/path/to/sample.csv'
>>> f = open(infile, 'r')
>>> text = f.read()
>>> print(text)
Red,Green,Orange,Purple,Red,Green,Orange,Purple,Red,Green,Orange,Purple
3,56,23,12,34,65,98,7,9,45,33,15
Then separate out your words and numbers, using regex:
>>> words = re.findall("[a-zA-Z]+", text)
>>> numbers = re.findall("[0-9]+", text)
>>> print(words)
>>> print(numbers)
Create your dataframe:
>>> df = pd.DataFrame({
... "Words": words,
... "Numbers": numbers
... })
>>> print(df)
Words Numbers
0 Red 3
1 Green 56
2 Orange 23
3 Purple 12
4 Red 34
5 Green 65
6 Orange 98
7 Purple 7
8 Red 9
9 Green 45
10 Orange 33
11 Purple 15
Group the words together. (This seems like a convoluted way of doing it, but I couldn't figure out a simpler one.)
>>> words_no_repeats = list(set(words))
>>> new_df = pd.DataFrame()
>>> for w in words_no_repeats:
... values = df[df['Words']==w]['Numbers'].to_list()
... temp_df = pd.DataFrame({w: values}, index=range(len(values)))
... new_df = pd.concat([new_df, temp_df], axis=1)
print(new_df)
Orange Green Red Purple
0 23 56 3 12
1 98 65 34 7
2 33 45 9 15
Then save your new dataframe as a csv:
>>> new_df.to_csv('/path/to/new_sample.csv', index=False)
This is what the csv file looks like:
Orange,Green,Red,Purple
23,56,3,12
98,65,34,7
33,45,9,15
I know you said in your comments that you were trying to avoid Pandas, but I don't know of any other way to do the grouping.

Related

How to extract text from a column in pandas

I have column in a pandas df that has this format "1_A01_1_1_NA I want to extract the text that is between the underscores e.g. "A01" "1" "1" and "NA" , I tried to use left right and mid but the problem is that at some point the column value changes into something like this 11_B40_11_8_NA.
Pd the df has 7510 rows.
Use str.split:
df = pd.DataFrame({'Col1': ['1_A01_1_1_NA', '11_B40_11_8_NA']})
out = df['Col1'].str.split('_', expand=True)
Output:
>>> out
0 1 2 3 4
0 1 A01 1 1 NA
1 11 B40 11 8 NA
The function you are looking for is Pandas.series.str.split().
You should be able to take your nasty column as a series and use the str.split("_", expand = True) method. You can see the "expand" keyword is exactly what you need to make new columns out of the results (splitting on the "_" character, not any specific index).
So, something like this:
First we need to create a little bit of nonsense like yours.
(Please forgive my messy and meandering code, I'm still new)
import pandas as pd
from random import choice
import string
# Creating Nonsense Data Frame
def make_nonsense_codes():
"""
Returns a string of nonsense like '11_B40_11_8_NA'
"""
nonsense = "_".join(
[
"".join(choice(string.digits) for i in range(2)),
"".join(
[choice(string.ascii_uppercase),
"".join([choice(string.digits) for i in range(2)])
]
),
"".join(choice(string.digits) for i in range(2)),
choice(string.digits),
"NA"
]
)
return nonsense
my_nonsense_df = pd.DataFrame(
{"Nonsense" : [make_nonsense_codes() for i in range(5)]}
)
print(my_nonsense_df)
# Nonsense
# 0 25_S91_13_1_NA
# 1 80_O54_58_4_NA
# 2 01_N98_68_3_NA
# 3 88_B37_14_9_NA
# 4 62_N65_73_7_NA
Now we can select our "Nonsense" column, and use str.split().
# Wrangling the nonsense column with series.str.split()
wrangled_nonsense_df = my_nonsense_df["Nonsense"].str.split("_", expand = True)
print(wrangled_nonsense_df)
# 0 1 2 3 4
# 0 25 S91 13 1 NA
# 1 80 O54 58 4 NA
# 2 01 N98 68 3 NA
# 3 88 B37 14 9 NA
# 4 62 N65 73 7 NA

I have 25 .csv files (each file is a scribe) all in same structure (X, Y and STATUE). I want to combine all of them into one large .txt file

So I have tried this and got all the files (25 scribes files) combine into one. Each scribe contains 3330 ID number and there is an co ordinate X and Y to highlight the number of defects (STATUE) for each ID number. I want to know the total sum of STATUE for each ID number from all the files combined.
import os
import pandas as pd
from glob import glob
stock_files = sorted(glob('*AVI.als'))
dfList = []
stock_files
df = pd.concat((pd.read_csv(file).assign(filename = file) for file in stock_files), ignore_index = True)
X\tY\tSTATUS filename
0 14\t1\t0 2008-09728-AVI.als
1 15\t1\t0 2008-09728-AVI.als
2 16\t1\t0 2008-09728-AVI.als
3 17\t1\t0 2008-09728-AVI.als
4 18\t1\t0 2008-09728-AVI.als
... ... ...
83245 30\t90\t0 2008-13754-AVI.als
83246 31\t90\t0 2008-13754-AVI.als
83247 32\t90\t0 2008-13754-AVI.als
83248 33\t90\t0 2008-13754-AVI.als
83249 34\t90\t0 2008-13754-AVI.als
for all CSV files combine into one .txt file, I should see the result like this below
X Y STATUS
0 14 1 0
1 15 1 0
2 16 1 0
3 17 1 0
4 18 1 0
...
3330
Any help is much appreciated
I think you simply need to add the separator (sep=r"\t"):
df = pd.concat([pd.read_csv(file, sep=r"\t").assign(filename = file) for file in stock_files], ignore_index = True)
You can simply save to .txt like this:
df.to_csv("output.txt")
If you want the sum of STATUS for each ID (X) you can do this:
df.groupby(["X"])["STATUS"].sum()

Filling in missing data in Python

I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
​
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4

Splitting Text File - Column to Rows in Python

I have a txt file which looks like:
X Y Z I
1 1 1 10
2 2 2 20
3 3 3 30
4 4 4 40
5 5 5 50
6 6 6 60
7 7 7 70
8 8 8 80
9 9 9 90
I want to split 4th column to 3 rows and export it to txt file.
10 20 30
40 50 60
70 80 90
This is just example. In my goal I have to split column with 675311 values into 16471 rows with 41 values. So first 41 values in column "I" will be first row.
If you use numpy, this is trivial and potentially more flexible:
Edit: added parameters for selecting which column to pick and how many columns the output table will have. You can change it to fit whatever shape you want the output to be.
import numpy as np
datacolumn = 3
outputcolumns = 3
data = np.genfromtxt('path/to/csvfile',skip_header=True)
column = data[:,datacolumn]
reshaped = column.reshape((len(column)/outputcolumns,outputcolumns))
np.savetxt('path/to/newfile',reshaped)
Edit: separated out comments from code for readability. Here's what each line does:
# Parse CSV file with header
# Extract 4th column
# Reshape column into new matrix
# Save matrix to text file
with open ( 'in.txt' , 'r') as f:
f.next() '# skip header
l = [x.split()[-1] for x in f]
print [l[x:x+3] for x in xrange(0, len(l),3)]
[['10', '20', '30'], ['40', '50', '60'], ['70', '80', '90']]
What I did is I made a list of all the numbers you wanted to write to the text file, then in another for loop with an output text file open, I looped through that list (using indicies because on every third one you wanted a new line). Then I have a local variable that is one more than i called j. I use that to check if i + 1 is a multiple of 3 (since I start at 0 every third iteration + 1 will be a multiple of 3). I write a new line character and continue on my way. If it is not a multiple of 3, I write a space and continue on my way.
nums = []
with open ('input.txt' , 'r') as f:
for line in f:
s = line.split(' ')
num = s[3]
nums.append(num)
with open('output.txt', 'w') as f:
for i in range(0, len(nums)):
num = nums[i].strip('\n')
f.write(num)
j = i + 1
if j%3 == 0:
f.write('\n')
else:
f.write(' ')

Convert excel or csv file to pandas multilevel dataframe

I've been given a reasonably large Excel file (5k rows), also as a CSV, that I would like to make into a pandas multilevel DataFame. The file is structured like this:
SampleID OtherInfo Measurements Error Notes
sample1 stuff more stuff
36 6
26 7
37 8
sample2 newstuff lots of stuff
25 6
27 7
where the number of measurements is variable (and sometimes zero). There is no full blank row in between any of the information, and the 'Measurements' and 'Error' columns are empty on rows that have the other (string) data; this might make it harder to parse(?). Is there an easy way to automate this conversion? My initial idea is to parse the file with Python first and then feed stuff into DataFrame slots in a loop, but I don't know exactly how to implement it, or if it is even the best course of action.
Thanks in advance!
Looks like your file has fixed width columns, for which read_fwf() can be used.
In [145]: data = """\
SampleID OtherInfo Measurements Error Notes
sample1 stuff more stuff
36 6
26 7
37 8
sample2 newstuff lots of stuff
25 6
27 7
"""
In [146]: df = pandas.read_fwf(StringIO(data), widths=[12, 13, 14, 9, 15])
Ok, now we have the data, just a little bit of extra work and you have a frame on which you can use set_index() to create a MultiLevel index.
In [147]: df[['Measurements', 'Error']] = df[['Measurements', 'Error']].shift(-1)
In [148]: df[['SampleID', 'OtherInfo', 'Notes']] = df[['SampleID', 'OtherInfo', 'Notes']].fillna()
In [150]: df = df.dropna()
In [151]: df
Out[151]:
SampleID OtherInfo Measurements Error Notes
0 sample1 stuff 36 6 more stuff
1 sample1 stuff 26 7 more stuff
2 sample1 stuff 37 8 more stuff
4 sample2 newstuff 25 6 lots of stuff
5 sample2 newstuff 27 7 lots of stuff
This will at least clean it up for additional processing.
import csv
reader = csv.Reader(open(<csv_file_name>)
data = []
keys = reader.next()
for row in reader():
r = dict(zip(keys,row))
if not r['measurements'] or not r['Error']:
continue
for key in ['SampleID', 'OtherInfo', 'Notes']:
if not r[key]:
index = -1
while True:
if data[index][key]:
r[key] = data[index][key]
break
index -= 1
data.append(r)

Categories

Resources