extracting data from columns in a text file using python - python

I am new to python file data processing. I have the following text file having the report of a new college campus. I want to extract the data from the column "colleges" and for "book_IDs_1" for block_ABC_top which is 23. I also want to know if there is any more occurrence of block_ABC_top in the colleges column and find the value for the book IDs_1 column.
Is it possible in a text file? or il have to change it to csv? How do i write a code for this data processing? Kindly help me!!
Copyright 1986-2019, Inc. All Rights Reserved.
Design Information
-----------------------------------------------------------------------------------------------------------------
| Version : (lin64) Build 2729669 Thu Dec 5 04:48:12 MST 2019
| Date : Wed Aug 26 00:46:08 2020
| Host : running 64-bit Red Hat Enterprise Linux Server release 7.8
| Command : college report
| Design : college
| Device : laptop
| Design State : in construction
-----------------------------------------------------------------------------------------------------------------
Table of Contents
-----------------
1. Information by Hierarchy
1. Information by Hierarchy
---------------------------
+----------------------------------------------+--------------------------------------------+------------+------------+---------+------+-----+
| colleges | Module | Total mems | book IDs_1 | canteen | BUS | UPS |
+----------------------------------------------+--------------------------------------------+------------+------------+---------+------+-----+
| block_ABC_top | (top) | 44 | 23 | 8 | 8 | 8 |
| (block_ABC_top_0) | block_ABC_top_0 | 5 | 5 | 5 | 2 | 9 |
+----------------------------------------------+--------------------------------------------+------------+------------+---------+------+-----+
I have a data List which has data of the colleges such as block_ABC_top, block_ABC_top_1,block_ABC_top, block_ABC_top_1...Here is my code below
The problem i face is..it only takes the data for data[0]..but i have data[0] and data[2] having the same college and i expect the check to happen twice.
with open ("utility.txt", 'r') as f1:
for line in f1:
if data[x] in line:
line_values = line.split('|')
if (int(line_values[4]) == 23 or int(line_values[7]) == 8):
filecheck = fullpath + "/" + filenames[x]
print filecheck
#print "check file "+ filenames[x]
x = x + 1
f1.close()

print [x.split(' ')[0] for x in open(file).readlines()] #colleges column
print [x.split(' ')[3] for x in open(file).readlines()] #book_IDs_1 column
Try running these.

Instead of going with the exact position of reach field, a better way would be to use the split() function, since you have your fields separated by a | symbol. You can loop thru the lines of the file and handle them accordingly.
for loop...:
line_values = line.split("|")
print(line_values[0]) # block_ABC_top

To extract Book id column data, use code below
with open('report.txt') as f:
for line in f:
if 'block_ABC_top' in line:
line_values = line.split('|')
print(line_values[4]) # PRINTS 23 AND 5

Related

Pandas not displaying all columns when writing to

I am attempting to export a dataset that looks like this:
+----------------+--------------+--------------+--------------+
| Province_State | Admin2 | 03/28/2020 | 03/29/2020 |
+----------------+--------------+--------------+--------------+
| South Dakota | Aurora | 1 | 2 |
| South Dakota | Beedle | 1 | 3 |
+----------------+--------------+--------------+--------------+
However the actual CSV file i am getting is like so:
+-----------------+--------------+--------------+
| Province_State | 03/28/2020 | 03/29/2020 |
+-----------------+--------------+--------------+
| South Dakota | 1 | 2 |
| South Dakota | 1 | 3 |
+-----------------+--------------+--------------+
Using this here code (runnable by running createCSV(), pulls data from COVID govt GitHub):
import csv#csv reader
import pandas as pd#csv parser
import collections#not needed
import requests#retrieves URL fom gov data
def getFile():
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID- 19/master/csse_covid_19_data/csse_covid_19_time_series /time_series_covid19_deaths_US.csv'
response = requests.get(url)
print('Writing file...')
open('us_deaths.csv','wb').write(response.content)
#takes raw data from link. creates CSV for each unique state and removes unneeded headings
def createCSV():
getFile()
#init data
data=pd.read_csv('us_deaths.csv', delimiter = ',')
#drop extra columns
data.drop(['UID'],axis=1,inplace=True)
data.drop(['iso2'],axis=1,inplace=True)
data.drop(['iso3'],axis=1,inplace=True)
data.drop(['code3'],axis=1,inplace=True)
data.drop(['FIPS'],axis=1,inplace=True)
#data.drop(['Admin2'],axis=1,inplace=True)
data.drop(['Country_Region'],axis=1,inplace=True)
data.drop(['Lat'],axis=1,inplace=True)
data.drop(['Long_'],axis=1,inplace=True)
data.drop(['Combined_Key'],axis=1,inplace=True)
#data.drop(['Province_State'],axis=1,inplace=True)
data.to_csv('DEBUGDATA2.csv')
#sets province_state as primary key. Searches based on date and key to create new CSVS in root directory of python app
data = data.set_index('Province_State')
data = data.iloc[:,2:].rename(columns=pd.to_datetime, errors='ignore')
for name, g in data.groupby(level='Province_State'):
g[pd.date_range('03/23/2020', '03/29/20')] \
.to_csv('{0}_confirmed_deaths.csv'.format(name))
The reason for the loop is to set the date columns (everything after the first two) to a date, so that i can select only from 03/23/2020 and beyond. If anyone has a better method of doing this, I would love to know.
To ensure it works, it prints out all the field names, inluding Admin2 (county name), province_state, and the rest of the dates.
However, in my CSV as you can see, Admin2 seems to have disappeared. I am not sure how to make this work, if anyone has any ideas that'd be great!
changed
data = data.set_index('Province_State')
to
data = data.set_index((['Province_State','Admin2']))
Needed to create a multi key to allow for the Admin2 column to show. Any smoother tips on the date-range section welcome to reopen
Thanks for the help all!

string manipulation, data wrangling, regex

I have a .txt file of 3 million rows. The file contains data that looks like this:
# RSYNC: 0 1 1 0 512 0
#$SOA 5m localhost. hostmaster.localhost. 1906022338 1h 10m 5d 1s
# random_number_ofspaces_before_this text $TTL 60s
#more random information
:127.0.1.2:https://www.spamhaus.org/query/domain/$
test
:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com
.0-1-hub.com
.zzzy1129.cn
:127.0.1.4:https://www.spamhaus.org/query/domain/$
.0-il.ml
.005verf-desj.com
.01accesfunds.com
In the above data, there is a code associated with all domains listed beneath it.
I want to turn the above data into a format that can be loaded into a HiveQL/SQL. The HiveQL table should look like:
+--------------------+--------------+-------------+-----------------------------------------------------+
| domain_name | period_count | parsed_code | raw_code |
+--------------------+--------------+-------------+-----------------------------------------------------+
| test | 0 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-0m5tk.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-1-hub.com | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .zzzy1129.cn | 2 | 127.0.1.2 | :127.0.1.2:https://www.spamhaus.org/query/domain/$ |
| .0-il.ml | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .005verf-desj.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
| .01accesfunds.com | 2 | 127.0.1.4 | :127.0.1.4:https://www.spamhaus.org/query/domain/$ |
+--------------------+--------------+-------------+-----------------------------------------------------+
Please note that I do not want the vertical bars in any output. They are just to make the above look like a table
I'm guessing that creating a HiveQL table like the above will involve converting the .txt into a .csv or a Pandas data frame. If creating a .csv, then the .csv would probably look like:
domain_name,period_count,parsed_code,raw_code
test,0,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-0m5tk.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-1-hub.com,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.zzzy1129.cn,2,127.0.1.2,:127.0.1.2:https://www.spamhaus.org/query/domain/$
.0-il.ml,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.005verf-desj.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
.01accesfunds.com,2,127.0.1.4,:127.0.1.4:https://www.spamhaus.org/query/domain/$
I'd be interested in a Python solution, but lack familiarity with the packages and functions necessary to complete the above data wrangling steps. I'm looking for a complete solution, or code tidbits to construct my own solution. I'm guessing regular expressions will be needed to identify the "category" or "code" line in the raw data. They always start with ":127.0.1." I'd also like to parse the code out to create a parsed_code column, and a period_count column that counts the number of periods in the domain_name string. For testing purposes, please create a .txt of the sample data I have provided at the beginning of this post.
Regardless of how you want to format in the end, I suppose the first step is to separate the domain_name and code. That part is pure python
rows = []
code = None
parsed_code = None
with open('input.txt', 'r') as f:
for line in f:
line = line.rstrip('\n')
if line.startswith(':127'):
code = line
parsed_code = line.split(':')[1]
continue
if line.startswith('#'):
continue
period_count = line.count('.')
rows.append((line,period_count,parsed_code, code))
Just for illustration, you can use pandas to format the data nicely as tables, which might help if you want to pipe this to SQL, but it's not absolutely necessary. Post-processing of strings are also quite straightforward in pandas.
import pandas as pd
df = pd.DataFrame(rows, columns=['domain_name', 'period_count', 'parsed_code', 'raw_code'])
print (df)
prints this:
domain_name period_count parsed_code raw_code
0 test 0 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
1 .0-0m5tk.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
2 .0-1-hub.com 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
3 .zzzy1129.cn 2 127.0.1.2 :127.0.1.2:https://www.spamhaus.org/query/doma...
4 .0-il.ml 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
5 .005verf-desj.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
6 .01accesfunds.com 2 127.0.1.4 :127.0.1.4:https://www.spamhaus.org/query/doma...
You can do all of this with the Python standard library.
HEADER = "domain_name | code"
# Open files
with open("input.txt") as f_in, open("output.txt", "w") as f_out:
# Write header
print(HEADER, file=f_out)
print("-" * len(HEADER), file=f_out)
# Parse file and output in correct format
code = None
for line in f_in:
if line.startswith("#"):
# Ignore comments
continue
if line.endswith("$"):
# Store line as the current "code"
code = line
else:
# Write these domain_name entries into the
# output file separated by ' | '
print(line, code, sep=" | ", file=f_out)

How to split a csv file row to columns in python?

Sorry, I am new to python. I have a csv file that gets data from google trends and writes to that file. However the output is all written to a same column. I want the date on column A and Bitcoin on column B and Cyptocurrency on column C and so on. I am really struggling with the simple task. Can any one help please? Thanks.
Below is the sample of the csv file.
"date Bitcoin Cryptocurrency Crypto isPartial"
"2013-10-27 5 0 0 False"
"2013-11-03 5 0 0 False"
"2013-11-10 5 0 0 False"
"2013-11-17 12 0 0 False"
"2013-11-24 14 0 0 False"
"2013-12-01 13 0 0 False"
This is my code to generate the file
#login
pytrend = TrendReq(google_username,google_password)
pytrend = TrendReq()
#Payload
pytrend.build_payload(kw_list=['Bitcoin','Cryptocurrency','Crypto'])
#interest over time
interest_over_time_df = pytrend.interest_over_time()
df = pd.DataFrame(interest_over_time_df)
file_name = "/Users/username/Desktop/Bitcoin.csv"
df.to_csv(file_name, sep='\t')
here you go. You will need pandas to load into a dataframe.
import pandas as pd
dataframe= pd.read_csv('Bitcoin.csv',delimiter=r"\s+")
dataframe
First of all, take a look at the CSV documentation for python, this should give you all the info and examples you need
Then I understand you want to write your rows as CSV separated by tabs so something like this should work for you:
# First you create a csv.Writer
spamwriter = csv.writer(csvfile, delimiter='\t')
# You write a row as a list into the csv.writer
spamwriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])
I was able find a few idea from other posts. Option 1 is simply just using formatting to make it look nice while Option 2 utilizes PrettyTable to give nice and formatted answer. You can find Pretty Table documenation here
Option 1 comes this previous post. All you would have to do is play around with the numbers so that the spacing is looks good enough to make you happy and of course change the file name to match your csv file.
Option 1
You could use format to left justify your output. For example,
f = open("contactlist.csv")
csv_f = csv.reader(f)
for row in csv_f:
print('{:<15} {:<15} {:<20} {:<25}'.format(*row))
Output:
Name Phone Company Email
Elon Musk 454-6723 SpaceX emusk#spacex.com
Larry Page 853-0653 Google lpage#gmail.com
Tim Cook 133-0419 Apple tcook#apple.com
Steve Ballmer 456-7893 Developers! sballmer#bluescreen.com
You can read more about format here. The < symbol left-aligns the text, and the number specifies the width of the string. Each {} can include a positional argument before the colon : - if they are omitted, the strings will appear in the order of the arguments in the unpacked list row.
Option 2
Option 2 I was able to find this information from here, Python Pretty Table
This page give you multitude of ways for solving this problem. Inlcuding a very simple of way by using the from_csv() function that can be imported from PrettyTable by using from prettytable import from_csv. Look at the example below for better insight.
Example:
Data.csv
"City name", "Area", "Population", "Annual Rainfall"
"Adelaide", 1295, 1158259, 600.5
"Brisbane", 5905, 1857594, 1146.4
"Darwin", 112, 120900, 1714.7
"Hobart", 1357, 205556, 619.5
"Sydney", 2058, 4336374, 1214.8
"Melbourne", 1566, 3806092, 646.9
"Perth", 5386, 1554769, 869.4
Python Code:
#!/usr/bin/python3
from prettytable import from_csv
with open("data.csv", "r") as fp:
x = from_csv(fp)
print(x)
Output will look something like the following:
+-----------+------+------------+-----------------+
| City name | Area | Population | Annual Rainfall |
+-----------+------+------------+-----------------+
| Adelaide | 1295 | 1158259 | 600.5 |
| Brisbane | 5905 | 1857594 | 1146.4 |
| Darwin | 112 | 120900 | 1714.7 |
| Hobart | 1357 | 205556 | 619.5 |
| Sydney | 2058 | 4336374 | 1214.8 |
| Melbourne | 1566 | 3806092 | 646.9 |
| Perth | 5386 | 1554769 | 869.4 |
+-----------+------+------------+-----------------+
Please let me know if this was beneficial by leaving a comment or casting a vote, thank you!

Slicing in python in a csv

I have a question about slicing in python. I'm working with a csv file and I want to get only the first value in row that corresponds with another value, which the user will specify. For example, my csv file looks like this:
| Date | Wind (mph) |
|------|------------|
| 20 | W 3 |
| 20 | W 3 |
| 20 | Vrbl 5 |
| 19 | Vrbl 7 |
| 19 | W 7 |
I want to get only the first wind direction value that corresponds with the date entered. From there, I want to get only the first letter. For example, if I requested the date of the 20th, I want wind = w. I think I need to slice the row, but I can't figure out where.
import csv
date = (raw_input("Please enter a date within the past three days (format: for 12/2/15, enter '02'): "))
with open('wind.csv', 'rb') as csvfile_wind:
reader3 = csv.reader(csvfile_wind)
for row in reader3:
if(row[0]) == date:
wind = (row[1])
print wind
You don't really need to split it.
You could just do
if(row[0]) == date:
wind = row[1][0]
print wind
to print only the first character at index [0] of the string in row[1]
What is the delimiter in the csv file?
Anyway, I assume that
row[1].split(" ")[0]
would do the trick.
E.g.,
In [1]: "w 3".split(" ")[0]
Out[1]: 'w'
Your code is correct.
Please, try to add some args csv.reader(csvfile_wind, dialect='excel-tab', delimiter=';')
You can also print row as well.
If row have splited you can see something like [20, 'some string']
And you don't need brackets:
if row[0] == date:
wind = row[1]
print wind

import series-like data file into pandas

Here is an example of the data file:
=====
name aaa
place paaa
date Thu Oct 1 12:02:03 2015
load_status 198
add_name naaa
[---blank line---]
=====
name bbb
place pbbb
date Thu Oct 3 21:20:36 2015
load_status 2000.327
add_name nbbb
[---blank line---]
In one file there might be hundreds of records like that.
I would like to get a pandas object looking like this:
name | place | date | load_status | add_name
---------------------------------------------------------------
aaa | paaa | Thu Oct 1 12:02:03 2015 | 198 | naaa
bbb | pbbb | Thu Oct 3 21:20:36 2015 | 2000.327 | nbbb
Number of fields in each record is the same: so all records has some 'name', 'place' and etc.
I can transpose the file with "bash+grep+awk" and then read it as csv but it's not practical for users who has only Python and Windows.
Transposing file using Python and then read it as csv looks like overkill as I expect Pandas should be able to handle this case some how.
I thought of Series+dtypes and read_table - but couldn't make them work for me.
Here's a simple loop in Python. You'll have to do some cleaning afterwards, and some checking afterwards, but this should get you started.
import pandas as pd
records = []
this_record = {}
with open(input_fn, 'r') as f:
for line in f:
if line.strip() == '':
records.append(this_record)
this_record = {}
continue
elif line.startswith('='):
continue
line = line.split()
this_record[line[0]] = ' '.join(line[1:]).strip()
df = pd.DataFrame.from_records(records)

Categories

Resources