import series-like data file into pandas - python

Here is an example of the data file:
=====
name aaa
place paaa
date Thu Oct 1 12:02:03 2015
load_status 198
add_name naaa
[---blank line---]
=====
name bbb
place pbbb
date Thu Oct 3 21:20:36 2015
load_status 2000.327
add_name nbbb
[---blank line---]
In one file there might be hundreds of records like that.
I would like to get a pandas object looking like this:
name | place | date | load_status | add_name
---------------------------------------------------------------
aaa | paaa | Thu Oct 1 12:02:03 2015 | 198 | naaa
bbb | pbbb | Thu Oct 3 21:20:36 2015 | 2000.327 | nbbb
Number of fields in each record is the same: so all records has some 'name', 'place' and etc.
I can transpose the file with "bash+grep+awk" and then read it as csv but it's not practical for users who has only Python and Windows.
Transposing file using Python and then read it as csv looks like overkill as I expect Pandas should be able to handle this case some how.
I thought of Series+dtypes and read_table - but couldn't make them work for me.

Here's a simple loop in Python. You'll have to do some cleaning afterwards, and some checking afterwards, but this should get you started.
import pandas as pd
records = []
this_record = {}
with open(input_fn, 'r') as f:
for line in f:
if line.strip() == '':
records.append(this_record)
this_record = {}
continue
elif line.startswith('='):
continue
line = line.split()
this_record[line[0]] = ' '.join(line[1:]).strip()
df = pd.DataFrame.from_records(records)

Related

extracting data from columns in a text file using python

I am new to python file data processing. I have the following text file having the report of a new college campus. I want to extract the data from the column "colleges" and for "book_IDs_1" for block_ABC_top which is 23. I also want to know if there is any more occurrence of block_ABC_top in the colleges column and find the value for the book IDs_1 column.
Is it possible in a text file? or il have to change it to csv? How do i write a code for this data processing? Kindly help me!!
Copyright 1986-2019, Inc. All Rights Reserved.
Design Information
-----------------------------------------------------------------------------------------------------------------
| Version : (lin64) Build 2729669 Thu Dec 5 04:48:12 MST 2019
| Date : Wed Aug 26 00:46:08 2020
| Host : running 64-bit Red Hat Enterprise Linux Server release 7.8
| Command : college report
| Design : college
| Device : laptop
| Design State : in construction
-----------------------------------------------------------------------------------------------------------------
Table of Contents
-----------------
1. Information by Hierarchy
1. Information by Hierarchy
---------------------------
+----------------------------------------------+--------------------------------------------+------------+------------+---------+------+-----+
| colleges | Module | Total mems | book IDs_1 | canteen | BUS | UPS |
+----------------------------------------------+--------------------------------------------+------------+------------+---------+------+-----+
| block_ABC_top | (top) | 44 | 23 | 8 | 8 | 8 |
| (block_ABC_top_0) | block_ABC_top_0 | 5 | 5 | 5 | 2 | 9 |
+----------------------------------------------+--------------------------------------------+------------+------------+---------+------+-----+
I have a data List which has data of the colleges such as block_ABC_top, block_ABC_top_1,block_ABC_top, block_ABC_top_1...Here is my code below
The problem i face is..it only takes the data for data[0]..but i have data[0] and data[2] having the same college and i expect the check to happen twice.
with open ("utility.txt", 'r') as f1:
for line in f1:
if data[x] in line:
line_values = line.split('|')
if (int(line_values[4]) == 23 or int(line_values[7]) == 8):
filecheck = fullpath + "/" + filenames[x]
print filecheck
#print "check file "+ filenames[x]
x = x + 1
f1.close()
print [x.split(' ')[0] for x in open(file).readlines()] #colleges column
print [x.split(' ')[3] for x in open(file).readlines()] #book_IDs_1 column
Try running these.
Instead of going with the exact position of reach field, a better way would be to use the split() function, since you have your fields separated by a | symbol. You can loop thru the lines of the file and handle them accordingly.
for loop...:
line_values = line.split("|")
print(line_values[0]) # block_ABC_top
To extract Book id column data, use code below
with open('report.txt') as f:
for line in f:
if 'block_ABC_top' in line:
line_values = line.split('|')
print(line_values[4]) # PRINTS 23 AND 5

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!
changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

Split of text file to multiple files and upload to data frames pandas

I have data in very ungly format:
Table 501
----------------------------------------------------------------
|Sale|Di|Dv|Cus |Mat |Valid From|Valid to |
----------------------------------------------------------------
|88|01|02|dd|20300 |24.05.2012|31.12.9999|
|889|01|02|dd|20300 |24.05.2012|31.12.9999|
|890|01|02|dd|20300 |24.05.2012|31.12.9999|
----------------------------------------------------------------
Table 55
---------------------------------------------------------
|Sale|Di|Dv|Cus |Grou|S|Valid From|Valid to |
---------------------------------------------------------
|4500|44|55|A|01560 | |11.02.2019|31.12.9999|
|4500|44|55|BBB|55070 | |30.04.2018|31.12.9999|
|4500|44|55|D|55080 | |30.04.2018|31.12.9999|
|4500|44|55|D|55420 | |30.04.2018|31.12.9999|
|4500|44|55|8834496 |55450 | |30.04.2018|31.12.9999|
---------------------------------------------------------
Table 065
----------------------------------------------------------------
|Sale|Di|Dv|Cus |Mat |Valid From|Valid to |
----------------------------------------------------------------
|4500|44|55|bbbb |01000 |29.05.2013|31.12.9999|
----------------------------------------------------------------
I want to used python to extract data from this txt to pandas dataframes = names of the tables are listed behind like Table_065 .
i thought I would read whole txt and split it to multiple txts, replace the lines starting with '-%' and ' %' and then upload it as single tables.
but I got stuck pretty soon:
file = open('0400.txt', 'r')
a = [n for n in file.readlines() if not n.startswith(' -') ]
#a = str(a)
#b = [n for n in a.readlines() if not n.startswith(' ') ]
it seems that after using the a variable it is no longer string but list, etc.
Simply I need help.
Please, is here anyone who can help me?
Thanks!
Try with a bit of manipulation and pandas errors handling when converting 9999 year to a datetime object
import pandas as pd
with open("0400.txt", "r") as f:
lines = [
[y.strip() for y in x.split("|")]
for x in f.readlines() if not x.startswith(" -")]
df = pd.DataFrame(lines[1:], columns=lines[0])
df["Valid to"] = pd.to_datetime(df["Valid to"], errors="coerce").fillna(pd.Timestamp.max.date())
df["Valid From"] = pd.to_datetime(df["Valid From"], errors="coerce")
print(df)
Sale Di Dv Cus Mat Valid From Valid to
0 0400 01 02 1327260 20300 2012-05-24 2262-04-11
1 0400 01 02 1327260 20300 2012-05-24 2262-04-11
2 0400 01 02 1327260 20300 2012-05-24 2262-04-11

string manipulation, data wrangling, regex

I have a .txt file of 3 million rows. The file contains data that looks like this:
# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com
In the above data, there is a code associated with all domains listed beneath it.
I want to turn the above data into a format that can be loaded into a HiveQL/SQL. The HiveQL table should look like:
+--------------------+--------------+-------------+-----------------------------------------------------+
| domain_name | period_count | parsed_code | raw_code |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test | 0 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+
Please note that I do not want the vertical bars in any output. They are just to make the above look like a table
I'm guessing that creating a HiveQL table like the above will involve converting the .txt into a .csv or a Pandas data frame. If creating a .csv, then the .csv would probably look like:
domain_name,period_count,parsed_code,raw_code
test,0,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-1-hub.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.zzzy1129.cn,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-il.ml,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.005verf-desj.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.01accesfunds.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
I'd be interested in a Python solution, but lack familiarity with the packages and functions necessary to complete the above data wrangling steps. I'm looking for a complete solution, or code tidbits to construct my own solution. I'm guessing regular expressions will be needed to identify the "category" or "code" line in the raw data. They always start with ":127.0.1." I'd also like to parse the code out to create a parsed_code column, and a period_count column that counts the number of periods in the domain_name string. For testing purposes, please create a .txt of the sample data I have provided at the beginning of this post.
Regardless of how you want to format in the end, I suppose the first step is to separate the domain_name and code. That part is pure python
rows = []
code = None
parsed_code = None
with open('input.txt', 'r') as f:
for line in f:
line = line.rstrip('\n')
if line.startswith(':127'):
code = line
parsed_code = line.split(':')[1]
continue
if line.startswith('#'):
continue
period_count = line.count('.')
rows.append((line,period_count,parsed_code, code))
Just for illustration, you can use pandas to format the data nicely as tables, which might help if you want to pipe this to SQL, but it's not absolutely necessary. Post-processing of strings are also quite straightforward in pandas.
import pandas as pd
df = pd.DataFrame(rows, columns=['domain_name', 'period_count', 'parsed_code', 'raw_code'])
print (df)
prints this:
domain_name period_count parsed_code raw_code
0 test 0 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
1 .0-0m5tk.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
2 .0-1-hub.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
3 .zzzy1129.cn 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
4 .0-il.ml 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
5 .005verf-desj.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
6 .01accesfunds.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
You can do all of this with the Python standard library.
HEADER = "domain_name | code"
# Open files
with open("input.txt") as f_in, open("output.txt", "w") as f_out:
# Write header
print(HEADER, file=f_out)
print("-" * len(HEADER), file=f_out)
# Parse file and output in correct format
code = None
for line in f_in:
if line.startswith("#"):
# Ignore comments
continue
if line.endswith("$"):
# Store line as the current "code"
code = line
else:
# Write these domain_name entries into the
# output file separated by ' | '
print(line, code, sep=" | ", file=f_out)

Slicing in python in a csv

I have a question about slicing in python. I'm working with a csv file and I want to get only the first value in row that corresponds with another value, which the user will specify. For example, my csv file looks like this:
| Date | Wind (mph) |
|------|------------|
| 20 | W 3 |
| 20 | W 3 |
| 20 | Vrbl 5 |
| 19 | Vrbl 7 |
| 19 | W 7 |
I want to get only the first wind direction value that corresponds with the date entered. From there, I want to get only the first letter. For example, if I requested the date of the 20th, I want wind = w. I think I need to slice the row, but I can't figure out where.
import csv
date = (raw_input("Please enter a date within the past three days (format: for 12/2/15, enter '02'): "))
with open('wind.csv', 'rb') as csvfile_wind:
reader3 = csv.reader(csvfile_wind)
for row in reader3:
if(row[0]) == date:
wind = (row[1])
print wind
You don't really need to split it.
You could just do
if(row[0]) == date:
wind = row[1][0]
print wind
to print only the first character at index [0] of the string in row[1]
What is the delimiter in the csv file?
Anyway, I assume that
row[1].split(" ")[0]
would do the trick.
E.g.,
In [1]: "w 3".split(" ")[0]
Out[1]: 'w'
Your code is correct.
Please, try to add some args csv.reader(csvfile_wind, dialect='excel-tab', delimiter=';')
You can also print row as well.
If row have splited you can see something like [20, 'some string']
And you don't need brackets:
if row[0] == date:
wind = row[1]
print wind

Categories

Resources