I want split a text file into multiple files by a column - python

I have a file of which the first column has repeated pattern as belows,
1999.2222 50 100
1999.2222 42 15
1999.2222 24 35
1999.2644 10 25
1999.2644 10 26
1999.3564 65 98
1999.3564 45 685
1999.3564 54 78
1999.3564 78 98
and I want this file into three files as
file1:
1999.2222 50 100
1999.2222 42 15
1999.2222 24 35
file2:
1999.2644 10 25
1999.2644 10 26
file3:
1999.3564 65 98
1999.3564 45 685
1999.3564 54 78
1999.3564 78 98
How could I split like this? Thanks:)

itertools.groupby is probably the most suitable choice for what you're after.
import itertools
with open('file.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]), 1):
# create file to write to suffixed with group number - start = 1
with open('file{0}.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')

Related

How to use use numpy random choice to get progressively longer sequences with the same numbers?

What I tried was this:
import numpy as np
def test_random(nr_selections, n, prob):
selected = np.random.choice(n, size=nr_selections, replace= False, p = prob)
print(str(nr_selections) + ': ' + str(selected))
n = 100
prob = np.random.choice(100, n)
prob = prob / np.sum(prob) #only for demonstration purpose
for i in np.arange(10, 100, 10):
np.random.seed(123)
test_random(i, n, prob)
The result was:
10: [68 32 25 54 72 45 96 67 49 40]
20: [68 32 25 54 72 45 96 67 49 40 36 74 46 7 21 20 53 65 89 77]
30: [68 32 25 54 72 45 96 67 49 40 36 74 46 7 21 20 53 62 86 60 35 37 8 48
52 47 31 92 95 56]
40: ...
Contrary to my expectation and hope, the 30 numbers selected do not contain all of the 20 numbers. I also tried using numpy.random.default_rng, but only strayed further away from my desired output. I also simplified the original problem somewhat in the above example. Any help would be greatly appreciated. Thank you!
Edit for clarification: I do not want to generate all the sequences in one loop (like in the example above) but rather use the related sequences in different runs of the same program. (Ideally, without storing them somewhere)

Writing a dict of large dataframes to excel

I am creating dicts where the dict keys are strings and the values are large-ish pandas DataFrames. I would like to write these dicts to an excel file but the issue I'm having is that when python writes the dataframe to a csv it cuts out parts. Code:
import pandas as pd
import numpy as np
def create_random_df():
return(pd.DataFrame(np.random.randint(0,100,size=(70,26)),columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')))
dic={'key1': create_random_df() , 'key2': create_random_df()}
with open('test.csv', 'w') as f:
for key in dic.keys():
f.write("%s,%s\n"%(key,dic[key]))
This sort of outputs the format I'd like except for the following:
All of the dataframe columns are in Cell B1 and they're not complete... it's
A B C D E F G H I ... R S T U V W X Y
Z
and then the indexes and dataframe elements are all in columns A. i.e. Cells A2:A4 is
0 55 96 60 47 11 3 2 69 50 ... 3 23 26 3 15 53
78 95 49
1 72 48 12 25 32 57 11 84 5 ... 11 43 56 0 68 55
95 64 84
2 80 56 78 58 79 72 67 97 58 ... 84 34 18 21 71 20
72 36 37
I'd like the dataframes to be written to the csv in their entirety and obviously the values in discrete cells
You can try:
dic={'key1': create_random_df() , 'key2': create_random_df()}
with open('test.csv', 'w') as f:
for key in dic.keys():
df = dic[key]
df.insert(0,'Key', pd.Series([key]))
df.Key = df.Key.fillna('')
f.write(df.to_csv(index=False))

Python. Trying to print list but its only printing directory structure

Hi when I try to print a list, it prints out the directory and not the contents of win.txt. I'm trying to enumerate the txt into a list and split then append it to a, then do other things once get a to print. What am I doing wrong?
import os
win_path = os.path.join(home_dir, 'win.txt')
def roundedStr(num):
return str(int(round(num)))
a=[] # i declares outside the loop for recover later
for i,line in enumerate(win_path):
# files are iterable
if i==0:
t=line.split(' ')
else:
t=line.split(' ')
t[1:6]= map(int,t[1:6])
a.append(t) ## a have all the data
a.pop(0)
print a
prints out directory, like example c:\workspace\win.txt
NOT what I want
I want it to print the contents of win.txt
which takes t[1:6] as integers, like
11 21 31 41 59 21
and prints that out like that same way.
win.txt contains this
05/06/2017 11 21 31 41 59 21 3
05/03/2017 17 18 49 59 66 9 2
04/29/2017 22 23 24 45 62 5 2
04/26/2017 01 15 18 26 51 26 4
04/22/2017 21 39 41 48 63 6 3
04/19/2017 01 19 37 40 52 15 3
04/15/2017 05 22 26 45 61 13 3
04/12/2017 08 14 61 63 68 24 2
04/08/2017 23 36 51 53 60 15 2
04/05/2017 08 20 46 53 54 13 2
I just want [1]-[6]
I think what you want is to open the file 'win.txt', and read its content. Using the open function to create a file object, and a with block to scope it. See my example below. This will read the file, and take the first 6 numbers of each line.
import os
win_path = os.path.join(home_dir, 'win.txt')
a=[] # i declares outside the loop for recover later
with open(win_path, 'r') as file:
for i,line in enumerate(file):
line = line.strip()
print(line)
if i==0:
t=line.split(' ')
else:
t=line.split(' ')
t[1:7]= map(int,t[1:7])
t = t[1:7]
a.append(t) ## a have all the data
a.pop(0)
print (a)

Word is not defined

In this program, I am trying to write the index out to a text file named "index.txt", along with printing it out. However, whenever i run the program, I get an error saying "words" is not defined, and my index.txt file only prints out word/tLine Numbers.
Code:
from string import punctuation
def makeIndex(filename):
wordIndex = {}
with open(filename) as f:
lineNum = 1
for line in f:
words = line.lower().split()
for word in words:
for char in punctuation:
word = word.replace(char, '')
if word.isalpha():
if word in wordIndex.keys():
if lineNum not in wordIndex[word]:
wordIndex[word].append(lineNum)
else:
wordIndex[word] = [lineNum]
lineNum += 1
return wordIndex
def output(wordIndex):
print("Word\tLine Numbers")
for key in sorted(wordIndex.keys()):
print(key, '\t', end=" ")
for lineNum in wordIndex[key]:
print(lineNum, end=" ")
print()
def main():
filename = input("What is the file name to be indexed?")
index = makeIndex(filename)
output(index)
with open('index.txt', 'w') as writefile:
writefile.write("Word/tLine Numbers")
print('t', end= "")
for index in range(len(word)):
print(word[index])
writefile.write(word[index] + '/n')
main()
Output:
What is the file name to be indexed?test.txt
Word Line Numbers
a 8 12 38 70 78
all 85 101
also 91
an 34 96
anagrams 93 104
as 84
ask 28
blocks 4
called 61
create 69
different 59
difficulties 47
each 74
employed 65
figure 32
file 9
find 100
finds 92
following 22
for 18 73
given 37
has 80
have 56
here 66
in 7 48
interesting 19
is 52 67
it 103
its 42 87
jumble 25
large 3
letters 43
long 54
many 58
new 14
of 5 16 41 45 86 102
one 44
opens 10
out 33
permutations 62 88
possibilities 17
problem 51
program 23 90
programs 20
puzzles 26
range 15
reorderings 60
same 82
scrambled 39
set 40
signature 72 83
since 94
so 57 76
solver 30
solves 24
solving 49
strategy 64
text 6
that 53 77
the 21 29 46 63 81
this 50 89
to 31 68
typing 95
unique 71
unknown 35
unscrambled 97
up 11
which 27
whole 13
will 99
with 2
word 36 75 79 98
words 55
working 1
tTraceback (most recent call last):
File "C:\Users\jp19p_000\Desktop\wordIndex(1).py", line 46, in <module>
main()
File "C:\Users\jp19p_000\Desktop\wordIndex(1).py", line 41, in main
for index in range(len(word)):
NameError: name 'word' is not defined
This is the index.txt file:
Word/tLine Numbers
from collections import defaultdict
import string
import sys
# convert to lowercase, remove all digits and punctuation
trans = str.maketrans(string.ascii_uppercase, string.ascii_lowercase, string.digits + string.punctuation)
def get_unique_words(s, trans=trans):
return set(s.translate(trans).split())
def make_index(seq, start=1):
index = defaultdict(list)
for i,s in enumerate(seq, start):
for word in get_unique_words(s):
index[word].append(i)
return index
def write_index(index, file=sys.stdout):
print("Word\tLines", file=file)
for word in sorted(index.keys()):
lines = " ".join(str(i) for i in index[word])
print("{}\t{}".format(word, lines), file=file)
def main():
fname = input("What is the name of the file to be indexed? ")
with open(fname) as inf:
index = make_index(inf)
with open("index.txt", "w") as outf:
write_index(index, outf)
if __name__=="__main__":
main()

How to write values to a csv file from another csv file

For index.csv file, its fourth column has ten numbers ranging from 1-5. Each number can be regarded as an index, and each index corresponds with an array of numbers in filename.csv.
The row number of filename.csv represents the index, and each row has three numbers. My question is about using a nesting loop to transfer the numbers in filename.csv to index.csv.
from numpy import genfromtxt
import numpy as np
import csv
data1 = genfromtxt('filename.csv', delimiter=',')
data2 = genfromtxt('index.csv', delimiter=',')
f = open('index.csv','wb')
write = csv.writer(f, delimiter=',',quoting=csv.QUOTE_ALL)
for row in data2:
for ch_row in data1:
if ( data2[row,3] == ch_row ):
write.writerow(data1[data2[row,3],:])
For example, the fourth column of index.csv contains 1,2,5,3,4,1,4,5,2,3 and filename.csv contains:
# filename.csv
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
What I need is to write the indexed row from filename.csv to index.csv and store these number in 5th, 6th and 7th column:
# index.csv
# 4 5 6 7
... 1 20 30 50
... 2 70 60 45
... 5 13 08 55
... 3 35 26 77
... 4 93 37 68
... 1 20 30 50
... 4 93 37 68
... 5 13 08 55
... 2 70 60 45
... 3 35 26 77
Can anyone help me solve this problem?
You need to indent your last 2 lines. Also, it looks like you are writing to the file from which you are reading.

Categories

Resources