I have csv file like this:
F1,F2,F3,F4,F5,F6,F7,F8,F9,F10,label
a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,L1
b1,b2,b3,b4,b5,b6,b7,b8,b9,b10,L2
I want to have the combination of columns into the file as follows:
-For a column combination with the label and write the expected results to:
file1:
F1,label
a1,L1
b1,L2
file2:
F2,label
a2,L1
b2,L2
until
file10:
F10,label
a10,L1
b10,L2
-For 2 column combinations with the label and write the expected results to:
2C_file1:
F1,F2,label
a1,a2,L1
b1,b2,L2
2C_file2:
F1,F3,label
a1,a3,L1
b1,b3,L2
until
45C_file45:
F9,F10,label
a9,a10,L1
b9,b10,L2
-For 3 columns combinations with the label and write to 120 files:
.....until.....
-For 9 columns combinations with the label and write to 10 files:
-For 10 columns combinations with the label and write to 1 files:
I have searched and I found a python code for string combination with itertool.
How could I archive above tasks with python code?
import itertools as iters
text='ABCDEFGHIJ'
C1= iters.combinations(text,1)
print list(C1)
C2= iters.combinations(text,2)
print list(C2)
.....
C9= iters.combinations(text,9)
print list(C9)
C10=iters.combinations(text,10)
print list(10)
This loop structure should be able to create the structure you would like to have. It has to be changed to write to files. Here the length of the generated sequence or the position of the output that was generated is part of the same structure that you would like to print into a file:
#!/usr/bin/env python
row1 = ["F{i}".format(i=i) for i in range(1,11)]
row1.append("label")
row2 = ["a{i}".format(i=i) for i in range(1,11)]
row2.append("L1")
row3 = ["b{i}".format(i=i) for i in range(1,11)]
row3.append("L2")
for SequenceLength in range(1, len(row1)):
for SequencePositionStart in range(len(row1)):
if row1[SequencePositionStart:SequenceLength] == []:
continue
print(','.join(row1[SequencePositionStart:SequenceLength]), row1[-1], sep=",")
print(','.join(row2[SequencePositionStart:SequenceLength]), row2[-1], sep=",")
print(','.join(row3[SequencePositionStart:SequenceLength]), row3[-1], sep=",")
Related
I have a dataframe, df, with 43244 rows, and a txt file, text with 1107957 lines. The purpose of the following code is to evaluate entries in df, and return a word_id value if they are present in the text.
with open('text.txt') as f:
text = f.readlines()
for index, row in df.iterrows():
lemma_id = 0
for lines in range(len(text)):
word_row = text[lines].split()
if word_row[2] == row['Word']:
word_id = word_row[1]
row['ID'] = word_id
However, this code would take an estimated 120 days to complete in my jupyter notebook, and I (obviously) want it to execute a bit more efficiently.
How do I approach this? Should I convert text into a dataframe/database, or is there another more efficient approach?
EDIT
Example of dataframe structure:
Word ID
0 hello NaN
1 there NaN
Example of txt.file structure:
NR ID WORD
32224 86289 ah
32225 86290 general
32226 86291 kenobi
Have you tried using pandas.merge?
Your for loop would be replaced by the following (assuming that text is a DataFrame)
new_df = pd.merge(df, text_df, left_on='WORD', right_on='Word')
new_df.dropna(subset=['ID'], inplace=True)
I'm trying to read a .txt file and output the count of each letter which works, however, I'm having trouble exporting that data to .csv in a specific way.
A snippet of the code:
freqs = {}
with open(Book1) as f:
for line in f:
for char in line:
if char in freqs:
freqs[char] += 1
else:
freqs[char] = 1
print(freqs)
And for the exporting to csv, I did the following:
test = {'Book 1 Output':[freqs]}
df = pd.DataFrame(test, columns=['Book 1 Output'])
df.to_csv(r'book_export.csv', sep=',')
Currently when I run it, the export looks like this (Manually done):
However I want the output to be each individual row, so it should look something like this when I open it:
I want it to separate it from the ":" and "," into 3 different columns.
I've tried various other answers on here but most of them end up with giving ValueErrors so maybe I just don't know how to apply it like the following one.
df[[',']] = df[','].str.split(expand=True)
Use DataFrame.from_dict with DataFrame.rename_axis for set index name, then csv looks like you need:
#sample data
freqs = {'a':5,'b':2}
df = (pd.DataFrame.from_dict(freqs, orient='index',columns=['Book 1 Output'])
.rename_axis('Letter'))
print (df)
Book 1 Output
Letter
a 5
b 2
df.to_csv(r'book_export.csv', sep=',')
Or alternative is use Series:
s = pd.Series(freqs, name='Book 1 Output').rename_axis('Letter')
print (s)
Letter
a 5
b 2
Name: Book 1 Output, dtype: int64
s.to_csv(r'book_export.csv', sep=',')
EDIT:
If there are multiple frequencies change DataFrame constructor:
freqs = {'a':5,'b':2}
freqs1 = {'a':9,'b':3}
df = pd.DataFrame({'f1':freqs, 'f2':freqs1}).rename_axis('Letter')
print (df)
f1 f2
Letter
a 5 9
b 2 3
I have a dataframe of the file name and its path in the format of a continuous string:
e.g:
files = pandas.Dataframe((
name path
0 file1.txt \\drive\folder1\folder2\folder3\...\file1.txt
1 file2.pdf \\drive\folder1\file2.pdf
2 file3.xls \\drive\folder1\folder2\folder3\...\folder21\file3.xls
n ... ...))
The size of the frame is about 1.02E+06 entries, the depth of the drive is at most 21 folders, but varies greatly.
The goal is have a dataframe in the format of:
name level1 level2 level3 level4 ... level21
0 file.txt folder1 folder2 folder3 0 ... 0
1 file.pdf folder1 0 0 0 ... 0
2 file3.xls folder1 folder2 folder3 folder4 ... folder21
...
I split the string of the file location and created an array with, which can be filled up with zeros if the path is shorter:
files = files.assign(plist=files['path'].iloc[:].apply(path_split))
def path_split(name):
return np.array(os.path.normpath(name).split(os.sep)[7:])
Add a column with number of folder in the files path:
files = files.assign(len_plist = files.plist.iloc[:].map(len))
The problem here is that the split path string creates an nested arrays within the dataframe.
Then an empty Dataframe with the number of columns in the quantity of folders ( 21 here) and rows accordin to the number of files (1.02E+06 here):
max_folder = files['len_plist'].max() # get the maximum amount of folders
levelcos = [ 'flevel_{}'.format(i) for i in np.arange(max_folder)]
levels = pd.DataFrame(np.zeros((files.shape[0],max_folder)),
columns =levelcos, index = files.index )
and now I fill the empty frame with the entries of the path array:
levels = fill_rows(levels,files.plist.values)
def fill_rows(df,array):
for i,row in enumerate(array):
df.iloc[i,:row.shape[0] - 1] = row[:-1]
return df
This takes a lot of time, since the varying length of the path arrays does not allow a vectorize solution right away. If I need to loop all 1.02E+06 rows of the dataframe, it would take at least 34h maybe up to 200h.
First and foremost, I want to optimize the filling of the dataframe and in a second step I would split the dataframe, parallelize the operations and assemble the frame again afterwards.
edit: added clarification, that a shorter path can be filled up to the maximum length with zeros.
Maybe I'm missing something but why doesn't this work for you?
expanded = files['path'].str.split(os.path.sep, expand=True).fillna(0)
expanded = expanded.rename(columns=lambda x: 'level_' + str(x))
df = pd.concat([files.name, expanded], axis=1)
I got 3 datasets which contain the flow in m3/s per location. Dataset 1 is a 5 year ARI flood, Dataset 2 is a 20 year ARI flood and Dataset 3 is a 50 year ARI flood.
Per location I found the maximum discharge (5,20 & 50)
Code:
for key in Data_5_ARI_RunID_Flow_New.keys():
m = key
y5F_RunID = Data_5_ARI_RunID_Flow_New.loc[:,m]
y20F_RunID = Data_20_ARI_RunID_Flow_New.loc[:,m]
y50F_RunID = Data_50_ARI_RunID_Flow_New.loc[:,m]
max_y5F = max(y5F_RunID)
max_y20F = max(y20F_RunID)
max_y50F = max(y50F_RunID)
Max_DataID = m, max_y5F, max_y20F, max_y50F
print (Max_DataID)
The output is like this:
('G60_18', 44.0514, 47.625, 56.1275)
('Area5_11', 1028.4065, 1191.5946, 1475.9685)
('Area5_12', 1017.8286, 1139.2628, 1424.4304)
('Area5_13', 994.5626, 1220.0084, 1501.1483)
('Area5_14', 995.9636, 1191.8066, 1517.4541)
Now I want to export this result to a csv file, but I don't know how. I used this line of code, but it didn't work:
Max_DataID.to_csv(r'C:\Users\Max_DataID.csv', sep=',', index = False)
Use this file name myexample.csv with specific path where you want to create the file.
Please check that Max_DataID is a iterable value. And as your reference the values are in form of tuple so I use list() to convert tuples into list and that will be supported values for writerow in csv.
import csv
file = open('myexample.csv', 'wb')
filewriter = csv.writer(file,delimiter =',')
for data in Max_DataID:
filewriter.writerow(list(data))
You can do the following.
df.to_csv(file_name, sep='\t')
Also, if you want to split it into chunks, like 10,000 rows, or whatever, you can do this.
import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path_here/main.csv', chunksize=10000)):
chunk.to_csv('chunk{}.csv'.format(i))
I have a CSV file that looks like this:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
...
0101,41.0
0102,39.9
0103,44.6
0104,42.0
0105,43.0
0106,42.4
It's a list of temperatures for specific dates. It contains data for several years so the same dates occur multiple times. I would like to average the temperature so that I get a new table where each date is only occurring once and has the average temperature for that date in the second column.
I know that Stack Overflow requires you to include what you've attempted, but I really don't know how to do this and couldn't find any other answers on this.
I hope someone can help. Any help is much appreciated.
You can use pandas, and run the groupby command, when df is your data frame:
df.groupby('DATE').mean()
Here is some toy example to depict the behaviour
import pandas as pd
df=pd.DataFrame({"a":[1,2,3,1,2,3],"b":[1,2,3,4,5,6]})
df.groupby('a').mean()
Will result in
a b
1 2.5
2 3.5
3 4.5
When the original dataframe was
a b
0 1 1
1 2 2
2 3 3
3 1 4
4 2 5
5 3 6
If you can use the defaultdict pacakge from collections, makes this type of thing pretty easy.
Assuming your list is in the same directory as the python script and it looks like this:
list.csv:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
Here is the code I used to print out the averages.
#test.py
#usage: python test.py list.csv
import sys
from collections import defaultdict
#Open a file who is listed in the command line in the second position
with open(sys.argv[1]) as File:
#Skip the first line of the file, if its just "data,value"
File.next()
#Create a dictionary of lists
ourDict = defaultdict(list)
#parse the file, line by line
for each in File:
# Split the file, by a comma,
#or whatever separates them (Comma Seperated Values = CSV)
each = each.split(',')
# now each[0] is a year, and each[1] is a value.
# We use each[0] as the key, and append vallues to the list
ourDict[each[0]].append(float(each[1]))
print "Date\tValue"
for key,value in ourDict.items():
# Average is the sum of the value of all members of the list
# divided by the list's length
print key,'\t',sum(value)/len(value)