How to extract data recursively on Linux? - python

I'm attempting to work on a large dataset, however, the format structure of the data has been split up into hundreds of directories.
data/:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z
data/0:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/1:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/2:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/3:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/4:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/5:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/6:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/7:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/8:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/9:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
data/a:
0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s symbols t u v w x y z
Furthermore, the file types are also completely random.
0: UTF-8 Unicode text
1: UTF-8 Unicode text
2: UTF-8 Unicode text
3: UTF-8 Unicode text
4: UTF-8 Unicode text
5: Non-ISO extended-ASCII text, with LF, NEL line terminators
6: UTF-8 Unicode text
7: UTF-8 Unicode text
8: UTF-8 Unicode text
9: UTF-8 Unicode text
a: UTF-8 Unicode text
...
z: UTF-8 Unicode text
The files contain a email:password format.
How can I get all of the content into a JSON file, or CSV file?
I'm looking to import the data to MongoDB.
Thanks.

I'm sure someone will help you better than I can but if I can point you in right direction I will.
Have you tried making a perl script? Ie
opendir(DIR, ".");
#files = grep(/\.cnf$/,readdir(DIR));
closedir(DIR);
foreach $file (#files) {
//shuv in a JSON file
}
Something like that?

The question was tagged with python, so I would recommend os.walk() (documentation) for recursively reading files. Something like:
# path is the path to the data
for subdir, dirs, files in os.walk(path):
for file in files:
file_path = os.path.join(subdir, file)
try:
read_file(file_path) # This is where you read the files and push to mongo etc
except:
continue
For the second part about reading Non-ISO extended-ASCII English text, there are some answers that might be helpful here: File encoding from English text to UTF-8

Related

How to apend a row in a pandas dataframe containing column's sum

Here is how my dataframe looks like
0 M M W B k a D G 247.719248 39.935064 12.983612 177.537373 214.337385 70.248041 78.162404 215.383443
1 n a Y j A N Q m 39.014265 64.053771 13.677425 169.164911 153.225780 31.095511 198.805600 179.653853
2 j z v I n N I X 152.177940 50.524997 79.063318 181.993409 51.367824 19.294708 217.844628 166.896151
3 n w a Y G B y O 243.468930 92.694170 200.305038 249.760627 156.588164 200.031428 146.933709 202.202242
4 R i h L J a q S 122.006004 34.979958 151.963992 116.795194 74.713682 252.979874 34.272430 45.334396
5 m Y n r u t t b 86.097651 229.911157 75.242197 214.069558 246.390175 235.507510 125.431980 90.467756
6 d i u d f Q a q 135.740363 13.388095 107.297373 10.520204 118.578496 101.770257 177.253815 78.800327
7 n F A x H u b y 55.497867 210.402998 191.356683 6.438180 85.967328 64.461602 157.265270 213.673103
8 q h w i S B h i 253.696469 168.964278 31.592088 160.404929 241.434909 232.280512 116.353252 11.540209
9 a z s d Y z l B 50.440346 80.492069 64.991017 88.663195 155.993675 85.967207 120.467390 71.219658
10 A U W m y R k K 156.153985 15.862058 95.013242 48.339397 235.440190 160.565380 236.421396 59.981690
11 z K K w o c n l 56.310181 210.101571 173.887020 181.040997 193.653296 250.875304 81.096499 234.868844
I want to append a row which will contain the sum of the column but it also contain string values.
I have tried this solution
df.loc['Total'] = df.select_dtypes(include=['float64', 'int64']).sum(axis=0)
But I am getting the sum in the string column as well like this
0 M M W B k a D G 247.719248 39.935064 12.983612 177.537373 214.337385 70.248041 78.162404 215.383443
1 n a Y j A N Q m 39.014265 64.053771 13.677425 169.164911 153.225780 31.095511 198.805600 179.653853
2 j z v I n N I X 152.177940 50.524997 79.063318 181.993409 51.367824 19.294708 217.844628 166.896151
3 n w a Y G B y O 243.468930 92.694170 200.305038 249.760627 156.588164 200.031428 146.933709 202.202242
4 R i h L J a q S 122.006004 34.979958 151.963992 116.795194 74.713682 252.979874 34.272430 45.334396
5 m Y n r u t t b 86.097651 229.911157 75.242197 214.069558 246.390175 235.507510 125.431980 90.467756
6 d i u d f Q a q 135.740363 13.388095 107.297373 10.520204 118.578496 101.770257 177.253815 78.800327
7 n F A x H u b y 55.497867 210.402998 191.356683 6.438180 85.967328 64.461602 157.265270 213.673103
8 q h w i S B h i 253.696469 168.964278 31.592088 160.404929 241.434909 232.280512 116.353252 11.540209
9 a z s d Y z l B 50.440346 80.492069 64.991017 88.663195 155.993675 85.967207 120.467390 71.219658
10 A U W m y R k K 156.153985 15.862058 95.013242 48.339397 235.440190 160.565380 236.421396 59.981690
11 z K K w o c n l 56.310181 210.101571 173.887020 181.040997 193.653296 250.875304 81.096499 234.868844
Total 1598.32 1211.31 1197.37 1604.73 1927.69 1705.08 1690.31 1570.02 1598.323248 1211.310187 1197.373003 1604.727974 1927.690905 1705.077334 1690.308374 1570.021673
Can i keep some value for the string sum? How it should be done?
Any help would be appreciated. I am newbie to pandas

Sorting a dataframe by another

I have an initial dataframe X:
x y z w
0 1 a b c
1 1 d e f
2 0 g h i
3 0 k l m
4 -1 n o p
5 -1 q r s
6 -1 t v à
with many columns and rows (this is a toy example). After applying some Machine Learning procedures, I get back a similar dataframe, but with the -1s changed to 0s or 1s and the rows sorted in a different way; for example:
x y z w
4 1 n o p
0 1 a b c
6 0 t v à
1 1 d e f
2 0 g h i
5 0 q r s
3 0 k l m
How could I do in order to sort the second dataframe as the first one? For example, like
x y z w
0 1 a b c
1 1 d e f
2 0 g h i
3 0 k l m
4 1 n o p
5 0 q r s
6 0 t v à
If you can't trust just sorting the indexes (e.g. if the first df's indexes are not sorted, or if you have something other than RangeIndex), just use loc
df2.loc[df.index]
x y z w
0 1 a b c
1 1 d e f
2 0 g h i
3 0 k l m
4 1 n o p
5 0 q r s
6 0 t v à
Use:
df.sort_index(inplace=True)
It restores the order, just by index

Why is my file seemingly being read incorrectly?

In Python I want to read from a large file:
def aggregate(file_input):
import fileinput
reviews = []
with open(file_input.replace(".txt", "_aggregated.txt"), "w") as outp:
currComp = ""
outp.write("Business;Stars_In_Sequence")
for line in fileinput.input(file_input):
reviews.append(MyReview(line))
if(currComp != reviews[-1].getCompany()):
currComp = reviews[-1].getCompany()
outp.write("\n" + currComp + ";" + reviews[-1].getStars())
outp.flush()
else:
outp.write(reviews[-1].getStars())
outp.flush()
The file looks like this:
Business;User;Review_Stars;Date;Length;Votes_Cool;Votes_Funny;Votes_Useful;
0DI8Dt2PJp07XkVvIElIcQ;jkrzTC5P5QGJRoKECzcleQ;5;2014-03-11;421;0;1;0
0DI8Dt2PJp07XkVvIElIcQ;cK78PTjb65kdmRL9BnEdoQ;5;2014-03-29;190;0;1;0
and works fine if I use only a small part of the file, returning the right output:
Business;Stars_In_Sequence
Business;R
0DI8Dt2PJp07XkVvIElIcQ;55555455555555515
LTlCaCGZE14GuaUXUGbamg;555555555
EDqCEAGXVGCH4FJXgqtjqg;3324133
However, if I use the original file it returns this, and I cant figure out why
Business;Stars_In_Sequence
ÿþB u s i n e s s ;
0 D I 8 D t 2 P J p 0 7 X k V v I E l I c Q ;
L T l C a C G Z E 1 4 G u a U X U G b a m g ;
E D q C E A G X V G C H 4 F J X g q t j q g ;

Concatenate strings along the off diagonals

Setup
import pandas as pd
from string import ascii_uppercase
df = pd.DataFrame(np.array(list(ascii_uppercase[:25])).reshape(5, 5))
df
0 1 2 3 4
0 A B C D E
1 F G H I J
2 K L M N O
3 P Q R S T
4 U V W X Y
Question
How do I concatenate the strings along the off diagonals?
Expected Result
0 A
1 FB
2 KGC
3 PLHD
4 UQMIE
5 VRNJ
6 WSO
7 XT
8 Y
dtype: object
What I Tried
df.unstack().groupby(sum).sum()
This works fine. But #Zero's answer is far faster.
You could do
In [1766]: arr = df.values[::-1, :] # or np.flipud(df.values)
In [1767]: N = arr.shape[0]
In [1768]: [''.join(arr.diagonal(i)) for i in range(-N+1, N)]
Out[1768]: ['A', 'FB', 'KGC', 'PLHD', 'UQMIE', 'VRNJ', 'WSO', 'XT', 'Y']
In [1769]: pd.Series([''.join(arr.diagonal(i)) for i in range(-N+1, N)])
Out[1769]:
0 A
1 FB
2 KGC
3 PLHD
4 UQMIE
5 VRNJ
6 WSO
7 XT
8 Y
dtype: object
You may also do arr.diagonal(i).sum() but ''.join is more explicit.

regex pattern won't return in python script

Why does the first snippet return digits, but the latter does not? I have tried more complicated expressions without success. The expressions I use are valid according to pythex.org, but do not work in the script.
(\d{6}-){7}\d{6}) is one such expression. I've tested it against this string: 123138-507716-007469-173316-534644-033330-675057-093280
import re
pattern = re.compile('(\d{1})')
load_file = open('demo.txt', 'r')
search_file = load_file.read()
result = pattern.findall(search_file)
print(result)
==============
import re
pattern = re.compile('(\d{6})')
load_file = open('demo.txt', 'r')
search_file = load_file.read()
result = pattern.findall(search_file)
print(result)
When I put the string into a variable and then search the variable it works just fine. This should work as is. But it doesn't help if I want to read a text file. I've tried to read each line of the file and that seems to be where the script breaks down.
import re
pattern = re.compile('((\d{6}-){7})')
#pattern = re.compile('(\d{6})')
#load_file = open('demo.txt', 'r')
#search_file = load_file.read()
test_string = '123138-507716-007469-173316-534644-033330-675057-093280'
result = pattern.findall(test_string)
print(result)
=========
printout,
Search File:
ÿþB i t L o c k e r D r i v e E n c r y p t i o n R e c o v e r y K e y
T h e r e c o v e r y k e y i s u s e d t o r e c o v e r t h e d a t a o n a B i t L o c k e r p r o t e c t e d d r i v e .
T o v e r i f y t h a t t h i s i s t h e c o r r e c t r e c o v e r y k e y c o m p a r e t h e i d e n t i f i c a t i o n w i t h w h a t i s p r e s e n t e d o n t h e r e c o v e r y s c r e e n .
R e c o v e r y k e y i d e n t i f i c a t i o n : f f s d f a - f s d f - s f
F u l l r e c o v e r y k e y i d e n t i f i c a t i o n : 8 8 8 8 8 8 8 8 - 8 8 8 8 - 8 8 8 8 - 8 8 8 8 - 8 8 8 8 8 8 8 8 8 8 8
B i t L o c k e r R e c o v e r y K e y :
1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1
6 6 6 6 6 6
Search Results:
[]
Process finished with exit code 0
================
This is where I ended up. It finds the string just fine and without the commas.
import re
pattern = re.compile('(\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6})')
load_file = open('demo3.txt', 'r')
for line in load_file:
print(pattern.findall(line))

Categories

Resources