string manipulation, data wrangling, regex - python

I have a .txt file of 3 million rows. The file contains data that looks like this:
# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com
In the above data, there is a code associated with all domains listed beneath it.
I want to turn the above data into a format that can be loaded into a HiveQL/SQL. The HiveQL table should look like:
+--------------------+--------------+-------------+-----------------------------------------------------+
| domain_name | period_count | parsed_code | raw_code |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test | 0 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+
Please note that I do not want the vertical bars in any output. They are just to make the above look like a table
I'm guessing that creating a HiveQL table like the above will involve converting the .txt into a .csv or a Pandas data frame. If creating a .csv, then the .csv would probably look like:
domain_name,period_count,parsed_code,raw_code
test,0,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-1-hub.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.zzzy1129.cn,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-il.ml,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.005verf-desj.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.01accesfunds.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
I'd be interested in a Python solution, but lack familiarity with the packages and functions necessary to complete the above data wrangling steps. I'm looking for a complete solution, or code tidbits to construct my own solution. I'm guessing regular expressions will be needed to identify the "category" or "code" line in the raw data. They always start with ":127.0.1." I'd also like to parse the code out to create a parsed_code column, and a period_count column that counts the number of periods in the domain_name string. For testing purposes, please create a .txt of the sample data I have provided at the beginning of this post.

Regardless of how you want to format in the end, I suppose the first step is to separate the domain_name and code. That part is pure python
rows = []
code = None
parsed_code = None
with open('input.txt', 'r') as f:
for line in f:
line = line.rstrip('\n')
if line.startswith(':127'):
code = line
parsed_code = line.split(':')[1]
continue
if line.startswith('#'):
continue
period_count = line.count('.')
rows.append((line,period_count,parsed_code, code))
Just for illustration, you can use pandas to format the data nicely as tables, which might help if you want to pipe this to SQL, but it's not absolutely necessary. Post-processing of strings are also quite straightforward in pandas.
import pandas as pd
df = pd.DataFrame(rows, columns=['domain_name', 'period_count', 'parsed_code', 'raw_code'])
print (df)
prints this:
domain_name period_count parsed_code raw_code
0 test 0 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
1 .0-0m5tk.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
2 .0-1-hub.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
3 .zzzy1129.cn 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
4 .0-il.ml 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
5 .005verf-desj.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
6 .01accesfunds.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...

You can do all of this with the Python standard library.
HEADER = "domain_name | code"
# Open files
with open("input.txt") as f_in, open("output.txt", "w") as f_out:
# Write header
print(HEADER, file=f_out)
print("-" * len(HEADER), file=f_out)
# Parse file and output in correct format
code = None
for line in f_in:
if line.startswith("#"):
# Ignore comments
continue
if line.endswith("$"):
# Store line as the current "code"
code = line
else:
# Write these domain_name entries into the
# output file separated by ' | '
print(line, code, sep=" | ", file=f_out)

Related

Grouping CSV Rows By The Names of Users

I have a table on Python with the following data from a CSV:
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| 123#yahoo.com | Brian | Computer Tech|
| example#gmail.com | Brian | Sales|
| someone#google.com |Gabby |Sales|
| testinge#sendesk.com |Gabby |Marketing|
| sandbox#aol.com |Tyler | Porter |
I want to be able to group the data by the Name and have all of the other cells come with it.
It should end up looking like this.
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| 123#yahoo.com | Brian | Computer Tech|
| example#gmail.com | Brian | Sales|
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| someone#google.com |Gabby |Sales|
| testinge#sendesk.com |Gabby |Marketing|
| subscriberKey | Name | Job
| -------- | -------------- |---------|
| sandbox#aol.com |Tyler | Porter |
Furthermore, I want to create a new csv file for every table that is created. Can anyone help? I have tried to loop it through but have failed too many times. I am currently back to the base and only have the file propagting in its normal table. Can anyone help?
import csv
f = open('work.csv')
csv_f = csv.reader(f)
for row in csv_f:
print (row)
When you are trying to group variables based on a certain key (the name in this case) a hashmap is usually a good data structure to try.
As a general solution for future readers:
Create an empty dictionary.
Choose the key that you want to group your data.
Iterate over the data and parse the key and related items.
Add the related items to dict[key].
Now each key in dict will have a list of all the items related to it.
Tailored more specifically to the OP's question:
import collections
def write_csv(name, lines):
with open(f"{name}_work.csv", "w") as f:
for line in lines:
f.write(','.join(item for item in line))
f.write('\n')
if __name__ == "__main__":
# LOAD DATA
with open("work.csv", 'r') as f:
lines = []
for line in f.readlines():
lines.append(line.strip('\n').split(','))
# GROUP DATA BY NAME INTO A DICTIONARY
names = collections.defaultdict(list)
for email, name, job in lines[1:]:
names[name].append((email, job))
# WRITE A NEW .csv FILE FOR EACH NAME
for name in names:
new_lines = lines[:1]
for email, job in names[name]:
new_lines.append([name, email, job])
write_csv(name, new_lines)

problem with line terminator \n on dataframe and .csv

Im getting (with an python API) a .csv file from an email attachment that i received in gmail, transforming it into a dataframe to make some dataprep, and saving as .csv on my pc. It is working great, the problem is that i get '\n' on some columns(it came like that from the source attachment).
the code that i used to get the data and transform into dataframe and .csv
r = io.BytesIO(part.get_payload(decode = True))
df = pd.DataFrame(r)
df.to_csv('C:/Users/x.csv', index = False)
Example of df that i get:
+-------------+----------+---------+----------------------+
| Information | Modified | Created | MD_x0020_Agenda\r\n' |
+-------------+----------+---------+----------------------+
| c | d | f | \r\n' |
| b\n' | | | |
| c | e | \r\n' | |
+-------------+----------+---------+----------------------+
example of answer that is correct:
+-------------+----------+---------+----------------------+
| Information | Modified | Created | MD_x0020_Agenda\r\n' |
+-------------+----------+---------+----------------------+
| c | d | f | \r\n' |
| b | c | e | \r\n' |
+-------------+----------+---------+----------------------+
i tried to use the line_terminator. in my mind, if i force it to get only \r\n and not \n, it would work. It didnt.
df.to_csv('C:/Users/x.csv', index = False, line_terminator='\r\n')
can somebody give me a help with that? its really freaking me out, because of that i cant advance at my project. thanks.
Usually, this "\n" appears to mark that sentence is going for next line i.e ‘return’ key, line break.
You can get rid of it just by applying replace('\n', '') on your dataframe:
df = df.replace('\n', '')
For more details on the function, consider checking this specific Pandas documentation
Hope it works.
I mixed the two answers and got the solution, thanks!!!!!
PS: with some research i found that this is a windows/excel issue, when you export .csv it considers \n and \r\n (\r too?) as new row. DataFrame considers only \r\n as new row(when default).
df = pd.read_csv(io.BytesIO(part.get_payload(decode = True)), header=None)
#grab the first row for the header
new_header = df.iloc[0]
#take the data less the header row
df = df[1:]
#set the header row as the df header
df.columns = new_header
#replace the \n wich is creating new lines
df['Information'] = df['Information'].replace(regex = '\n', value = '')
df.to_csv('C:/Users/x.csv', index = False', index = False)

Taking last line from each row in a large csv file?

I have a 12000 rows with multiple lines in each row.
I need to read and write into a new column only last lines in all 12000 rows
"► Контакт с пациентом | 07.02.2019 | |
► Принять в работу | 07.02.2019 | |
► Контакт с пациентом | 08.02.2019 | |
► Получить КП | 14.02.2019 | |
► ждем КП | 18.02.2019 | |
► отправил ему ответ и стоимости лекарств! через дви недели с ним связываться | 05.03.2019 | |
► арихив | 23.03.2019 | | ";
"► Контакт с пациентом | 19.06.2019 | |
► Принять в работу | 19.06.2019 | |
► Контакт с пациентом | 26.08.2019 | |
► Архив. | 10.09.2019 | | ";
I can do that only for one row and thats it. How can I do that through all 12000 rows
import pandas as pd
df = pd.read_csv('/Users/gfidarov/Desktop/crosscheck/crosscheck/sheet1')
r = df.split('|')
r = r[-4:]
r = '|'.join(r)
print(r)
here I can read that with csv library but I can't take only the last one. And if I try to make it like I did with pandas row = row[-4:] I am getting error. How can I solve my problem?
import csv
with open('/Users/gfidarov/Desktop/sheet_one') as f:
reader = csv.DictReader(f, delimiter='|')
for row in reader:
print(list(row))
For that file, the last line of each row is the line ending with a semicolon (;) following a double quote (").
So this could be enough:
with open('/Users/gfidarov/Desktop/sheet_one') as f:
for line in f:
if line.strip().endswith('";'): # Ok this is the line we want...
line = line.strip().strip('";') # clean it a little
print(line)
BTW, the csv try did not work because by default the double quote is used to quote fieds containing the delimiter or new lines, so here the csv module will only see one single field.
row in DictReader is a dict, where the keys are taken from the first row
When you use list(row), that only gives you those keys
You want to use csv.reader instead of csv.DictReader, which gives you a list for each row.
with open('/Users/gfidarov/Desktop/sheet_one.csv') as f:
reader = csv.reader(f, delimiter='|')
for row in reader:
print(row)
Also, like #BergeBallesta said, the double quotes cause the error
but you need to use a text editor, to find and replace the "s and the ;s, so the csv module can read it properly

I want to display variables in table format that should be perfectly align in python [duplicate]

This question already has answers here:
Printing Lists as Tabular Data
(20 answers)
Closed 3 years ago.
I want to make a table in python
+----------------------------------+--------------------------+
| name | rank |
+----------------------------------+--------------------------+
| {} | [] |
+----------------------------------+--------------------------+
| {} | [] |
+----------------------------------+--------------------------+
But the problem is that I want to first load a text file that should contain domains name and then I would like to making a get request to each domain one by one and then print website name and status code in table format and table should be perfectly align. I have completed some code but failed to display output in a table format that should be in perfectly align as you can see in above table format.
Here is my code
f = open('sub.txt', 'r')
for i in f:
try:
x = requests.get('http://'+i)
code = str(x.status_code)
#Now here I want to display `code` and `i` variables in table format
except:
pass
In above code I want to display code and i variables in table format as I showed in above table.
Thank you
You can achieve this using the center() method of string. It creates and returns a new string that is padded with the specified character.
Example,
f = ['AAA','BBBBB','CCCCCC']
codes = [401,402,105]
col_width = 40
print("+"+"-"*col_width+"+"+"-"*col_width+"+")
print("|"+"Name".center(col_width)+"|"+"Rank".center(col_width)+"|")
print("+"+"-"*col_width+"+"+"-"*col_width+"+")
for i in range(len(f)):
_f = f[i]
code = str(codes[i])
print("|"+code.center(col_width)+"|"+_f.center(col_width)+"|")
print("+"+"-"*col_width+"+"+"-"*col_width+"+")
Output
+----------------------------------------+----------------------------------------+
| Name | Rank |
+----------------------------------------+----------------------------------------+
| 401 | AAA |
+----------------------------------------+----------------------------------------+
| 402 | BBBBB |
+----------------------------------------+----------------------------------------+
| 105 | CCCCCC |
+----------------------------------------+----------------------------------------+

How do I save the header and units of an astropy Table into an ascii file

I'm trying to create an ascii table with some information on the header, the names and units of the columns and some data, it should look like this:
# ... Header Info ...
Name | Morphology | ra_u | dec_u | ...
| InNS+B+MOI | HH:MM:SS.SSS | ±DD:MM:SS:SSS| ...
==============| ========== | ============ | ============ | ...
1_Cam_A | I | 04:32:01.845 | +53:54:39.03 ...
10_Lac | I | 22:39:15.679 | +39:03:01.01 ...
...
So far I've tried with numpy.savetxt and astropy.ascii.writhe, numpy won't really solve my problems and with ascii.write I've been able to get something similar but not quite right:
Name | Morphology | ra_u | dec_u | ...
================== | ========== | ============ | ============ | ...
1_Cam_A | I | 04:32:01.845 | +53:54:39.03 ...
...
I'm using this code:
formato= {'Name':'%-23s','Morphology':'%-10s','ra_u':'%s','dec_u':'%s',...}
names=['Name','Morphology','ra_u','dec_u','Mag6']
units=['','InNS+B+MOI','HH:MM:SS.SSS','±DD:MM:SS:SSS',...]
ascii.write(data, output='pb.txt',format='fixed_width_two_line',position_char='=',delimiter=' | ',names=names, formats=formato)
So if I make a print in my terminal the table looks as it should except for the header info, but as I save it into a file the units disappear...
Is there any way to include them in the file?, or I need to save the file and edit it later?
P.D.: I'm also tried some other formats such as IPAC for ascii.write, in that case the problem is that includes a 4th row in the header like: '| null | null |.....' and I don't know how to get rid of it...
Thanks for the help
Un saludo.
There doesn't appear to be a straightforward way to write out the units of a column in a generic way using astropy.table or astropy.io.ascii. You may want to raise an issue at https://github.com/astropy/astropy/issues with a feature request.
However, there is a pretty simple workaround using the format ascii.ipac:
tbl.write('test.txt', format='ascii.ipac')
with open('test.txt', 'r') as fh:
output = []
for ii, line in enumerate(fh):
if ii not in (1,3):
output.append(line)
with open('test.txt', 'w') as fh:
fh.writelines(output)
which will write out in the IPAC format, then remove the 2nd and 4th lines.
Unless your table absolute has to be in that format, if you want an ASCII table with more complex metadata for the columns please consider using the ECSV format.

Categories

Resources