Transform a txt file to a pandas dataframe - python

hi I have the following txt file
December
line: 285 - event ID: 67511
line: 296 - event ID: 67512
November
line: 305 - event ID: 67515
line: 300 - event ID: 67517
I want to transform it into the following data frame
df1 = pd.DataFrame(
{
"index": ["December", "December", "November", "November"],
"index1": ["285", "296", "305", "300"],
"eventid": ["67511", "67512", "64515", "64517"]})
index index1 eventid
0 December 285 67511
1 December 296 67512
2 November 305 64515
3 November 300 64517
any ideas?

I have used pattern matching to achieve what you need:
import re
import pandas as pd
res = []
month_pattern = re.compile("^\w+$")
line_pattern = re.compile("\d+")
current_month = ""
with open("FILE_PATH_TO_YOUR_DATA", "r") as f:
for line in f:
m = month_pattern.findall(line)
if len(m) > 0:
current_month = m[0]
m = line_pattern.findall(line)
if len(m) > 0:
res.append([current_month] + m)
df = pd.DataFrame(res, columns = ["index", "index1", "eventid"])
print(df)
OUTPUT
index index1 eventid
0 December 285 67511
1 December 296 67512
2 November 305 67515
3 November 300 67517

Related

Character specific conditional check in a string

I have to read and analyse some log files using python which usually containing strings in the following desired format :
date Don Dez 10 21:41:41.747 2020
base hex timestamps absolute
no internal events logged
// version 13.0.0
//28974.328957 previous log file: 21-41-41_Voltage.asc
// Measurement UUID: 9e0029d6-43a0-49e3-8708-3ec70363124c
28976.463987 LoggingString := "Log,5:45 AM, Friday, December 11, 2020,05:45:20.6,65.48,11.99,0.009843,12,0.01078,11.99,0.01114,11.99,0.01096,12,0.009984,4.595,0,1.035,0,0.1745,0,2,OM_2_1,0"
28978.600018 LoggingString := "Log,5:45 AM, Friday, December 11, 2020,05:45:22.7,65.47,11.99,0.009896,12,0.01079,11.99,0.01117,11.99,0.01097,12,0.009965,4.628,0,1.044,0,0.1698,0,2,OM_2_1,0"
However, sometime it occurs that files are created that have undesired formats like below :
date Die Jul 13 08:40:22.878 2021
base hex timestamps absolute
no internal events logged
// version 13.0.0
//1035.595166 previous log file: 08-40-22_Voltage.asc
// Measurement UUID: 2baf3f3f-300a-4f0a-bcbf-0ba5679d8be2
"1203.997816 LoggingString := ""Log" 9:01 am Tuesday July 13 2021 09:01:58.3 24.53 13.38 0.8948 13.37 0.8801 13.37 0.89 13.37 0.9099 13.47 0.8851 4.551 0.00115 0.8165 0 0.2207 0 5 OM_3_2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 "0"""
"1206.086064 LoggingString := ""Log" 9:02 am Tuesday July 13 2021 09:02:00.4 24.53 13.37 0.8945 13.37 0.8801 13.37 0.8902 13.37 0.9086 13.46 0.8849 5.142 0.001185 1.033 0 0.1897 0 5 OM_3_2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 "0"""
OR
date Mit Jun 16 10:11:43.493 2021
base hex timestamps absolute
no internal events logged
// version 13.0.0
// Measurement UUID: fe4a6a97-d907-4662-89f9-bd246aa54a33
10025.661597 LoggingString := """""""Log""" 12:59 PM Wednesday June 16 2021 12:59:01.1 66.14 0.00423 0 0.001206 0 0.001339 0 0.001229 0 0.001122 0 0.05017 0 0.01325 0 0.0643 0 0 OM_2_1_transition 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 """0"""""""
10030.592652 LoggingString := """""""Log""" 12:59 PM Wednesday June 16 2021 12:59:06.1 66.14 11.88 0.1447 11.88 0.1444 11.88 0.1442 11.87 0.005552 11.9 0.00404 2.55 0 0.4712 0 0.09924 0 0 OM_2_1_transition 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 """0"""""""
Since i am only concerned with data below "// Measurement UUID " line, i am using this code to extract data from the string that is of desired format :
files = os.listdir(directory)
files = natsorted(files)
for file in files:
base, ext = os.path.splitext(file)
if file not in processed_files and ext == '.asc':
print("File added:", file)
file_path = os.path.join(directory, file)
count = 0
with open(file_path, 'r') as file_in:
processed_files.append(file)
Output_list = [] # Each string from file is read into this list
Final = [] # Required specific data from each string is isolated & stored here
for line in map(str.strip, file_in):
if "LoggingString" in line:
first_quote = line.index(
'"') # returns the column number where " first appears in the whole string
last_quote = line.index('"', first_quote + 1)
# returns the column value where " appears last in the whole string ( end of line )
# print(first_quote)
Output_list.append(
line[:first_quote].split(maxsplit=1)
+ line[first_quote + 1: last_quote].split(","),
)
Final.append(Output_list[count][7:27])
The undesired format contains one or more whitespaces between each string character as seen above. I guess it is because the log file generator sometime generates a non comma separate file or a comma separated file with error probably, i am not sure.
I tried to put the condition after:
if "LoggingString" in line :
if ',' in line:
first_quote = line.index('"')
last_quote = line.index('"', first_quote + 1)
Output_list.append(line[:first_quote].split(maxsplit=1)
+ line[first_quote + 1: last_quote].split(","),)
Final.append(Output_list[count][7:27])
else:
raise Exception("Error in File")
However, this didn't serve the purpose because if in any other undesired format if there is even one ',' in the string, the program would consider it valid and process it which results in false results.
How do i ensure that after the only files that contain strings in desired format are processed and if others are processed then an error message would be print out ? What type of conditional check could be implemented here ?
You can use pandas.read_csv with a regex separator :
import glob
import pandas as pd
l = []
for f in glob.glob("/tmp/Log*.txt"):
df = (pd.read_csv(f, sep=',|(?<=[\w"])\s+(?=[\w"])',
header=None, skiprows=6, engine="python").iloc[:, 2:28])
df.insert(0, "filename", f.split("\\", )[-1])
l.append(df)
out = pd.concat(l)
Output :

How do I assign year&months in PD dataframe?

My pandaframe looks very weird after running the code. The data doesnt not come with a year/month variable so I have to add them manually. Is there a way I could do that?
sample = []
url1 = "https://api.census.gov/data/2018/cps/basic/jan?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
url2 = "https://api.census.gov/data/2018/cps/basic/feb?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
url3 = "https://api.census.gov/data/2018/cps/basic/mar?get=PEFNTVTY,PEMNTVTY&for=state:01&PEEDUCA=39&key=YOUR_KEY_GOES_HERE"
sample.append(requests.get(url1).text)
sample.append(requests.get(url2).text)
sample.append(requests.get(url3).text)
sample = [json.loads(i) for i in sample]
sample = pd.DataFrame(sample)
sample
Consider read_json to directly read the Census URL API inside a user-defined method. Then iterate pairwise through all possible pairs of years and months using itertools.product to build data frame and assign corresponding columns:
import pandas as pd
import calendar
import itertools
def get_census_data(year, month):
# BUILD DYNAMIC URL
url = (
f"https://api.census.gov/data/{year}/cps/basic/{month.lower()}?"
"get=PEFNTVTY,PEMNTVTY&for=state:01"
)
# CLEAN RAW DATA FOR APPROPRIATE ROWS AND COLS, ASSIGN YEAR/MONTH COLS
raw_df = pd.read_json(url)
cps_df = (
pd.DataFrame(raw_df.iloc[1:,])
.set_axis(raw_df.iloc[0,], axis="columns", inplace=False)
.assign(year = year, month = month)
)
return cps_df
# MONTH AND YEAR LISTS
months_years = itertools.product(
range(2010, 2021),
calendar.month_abbr[1:13]
)
# ITERATE PAIRWISE THROUGH LISTS
cps_list = [get_census_data(yr, mo) for yr, mo in months_years]
# COMPILE AND CLEAN FINAL DATA FRAME
cps_df = (
pd.concat(cps_list, ignore_index=True)
.drop_duplicates()
.reset_index(drop=True)
.rename_axis(None, axis="columns")
)
Output
cps_df
PEFNTVTY PEMNTVTY state year month
0 57 57 1 2010 Jan
1 303 303 1 2010 Jan
2 233 233 1 2010 Jan
3 57 233 1 2010 Jan
4 73 73 1 2010 Jan
... ... ... ... ...
6447 210 139 1 2020 Dec
6448 363 363 1 2020 Dec
6449 301 57 1 2020 Dec
6450 57 242 1 2020 Dec
6451 416 416 1 2020 Dec
[6452 rows x 5 columns]
The response to each API call is a JSON array of arrays. You called the wrong DataFrame constructor. Try this:
base_url = "https://api.census.gov/data/2018/cps/basic"
params = {
"get": "PEFNTVTY,PEMNTVTY",
"for": "state:01",
"PEEDUCA": 39,
}
df = []
for month in ["jan", "feb", "mar"]:
r = requests.get(f"{base_url}/{month}", params=params)
r.raise_for_status()
j = r.json()
df.append(pd.DataFrame.from_records(j[1:], columns=j[0]).assign(month=month))
df = pd.concat(df)
Result:
PEFNTVTY PEMNTVTY PEEDUCA state month
0 57 57 39 1 jan
1 57 57 39 1 jan
2 57 57 39 1 jan
3 57 57 39 1 jan
4 57 57 39 1 jan
...

write rows in pandas dataframe and append it to existing dataframe

I have the output of my script as year and the count of word from an article in that particular year :
abcd
2013
118
2014
23
xyz
2013
1
2014
45
I want to have each year added as a new column to my existing dataframe which contains only words.
Expected output:
Terms 2013 2014 2015
abc 118 76 90
xyz 23 0 36
The input for my script was a csv file :
Terms
xyz
abc
efg
The script I wrote is :
df = pd.read_csv('a.csv', header = None)
for row in df.itertuples():
term = (str(row[1]))
u = "http: term=%s&mindate=%d/01/01&maxdate=%d/12/31"
print(term)
startYear = 2013
endYear = 2018
for year in range(startYear, endYear+1):
url = u % (term.replace(" ", "+"), year, year)
page = urllib.request.urlopen(url).read()
doc = ET.XML(page)
count = doc.find("Count").text
print(year)
print(count)
The df.head is :
0
0 1,2,3-triazole
1 16s rrna gene amplicons
Any help will be greatly appreciated, thanks in advance !!
I would read the csv with numpy in an array, then reshape it also with numpy and then the resulting matrix/2D array to a DataFrame
Something like this should do it:
#!/usr/bin/env python
def mkdf(filename):
def combine(term, l):
d = {"term": term}
d.update(dict(zip(l[::2], l[1::2])))
return d
term = None
other = []
with open(filename) as I:
n = 0
for line in I:
line = line.strip()
try:
int(line)
except Exception as e:
# not an int
if term: # if we have one, create the record
yield combine(term, other)
term = line
other = []
n = 0
else:
if n > 0:
other.append(line)
n += 1
# and the last one
yield combine(term, other)
if __name__ == "__main__":
import pandas as pd
import sys
df = pd.DataFrame([r for r in mkdf(sys.argv[1])])
print(df)
usage: python scriptname.py /tmp/IN ( or other file with your data)
Output:
2013 2014 term
0 118 23 abcd
1 1 45 xyz

Compare some columns from some tables using python

I need to compare two values MC and JT from 2 tables:
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -8.25705 0.219113 -0.000800014 20.8926 41.4347 5.75852 0 4.13067 0
1 423 17950 18150 210 180 17400 18430 1 0 -4.26426 0.586578 -0.053 77.22 85.2104 22.0534 0 3.551 0
2 468 41790 42020 240 50 41360 42380 0 0 7.82681 0.181248 -0.00269566 90.0646 92.7698 5.0841 0 4.19304 0
and
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -0.846655 0.0218695 2.59898e-05 2.0724 4.1259 0.583259 10 0.412513 0
1 423 17950 18150 210 180 17400 18780 1 0 -0.453311 0.058732 -0.00526783 7.7403 8.52544 2.19627 0 0.354126 0
2 468 41790 42020 240 70 41360 42380 0 0 0.743716 0.0181613 -0.000256186 9.08777 9.21395 0.502506 0 0.419265 0
I need to do it using module csv. I know how to do it using pandas and xlrd, but using csv don't know.
Desire output:
Number_of_strings MC JT
And print strings, where values are different
import csv
old = csv.reader(open('old.csv', 'rb'), delimiter=',')
row1 = old.next()
new = csv.reader(open('new.csv', 'rb'), delimiter=',')
row2 = new.next()
if (row1[8] == row2[8]) and (row1[9] == row2[9]):
continue
else:
print row1[0] + ':' + row1[8] + '!=' + row2[8]
You can try something like the following:
old = list(csv.reader(open('old.csv', 'rb'), delimiter=','))
new = list(csv.reader(open('new.csv', 'rb'), delimiter=','))
old = zip(*old)
new = zip(*new)
print ['%s-%s-%s'%(str(a), str(b), str(c)) for a, b, c in zip(old[0], new[8], old[8]) if b != c]
First, we get a list of lists. zip(*x) will transpose a list of lists. The rest should be easy to decipher ...
You can actually put whatever you want within the string ...

Group rows in a CSV by blocks of 25

I have a csv file with 2 columns, representing a distribution of items per year which looks like this:
A B
1900 10
1901 2
1903 5
1908 8
1910 25
1925 3
1926 4
1928 1
1950 10
etc, about 15000 lines.
When making a distribution diagram based on this data, it's too many points on an axe, not very pretty. I want to group rows by blocks of 25 years, so that at the end I would have less points at the axe.
So, for example, from 1900 till 1925 I would have a sum of produced items, 1 row in A column and 1 row in B column:
1925 53
1950 15
So far I only figured how to convert the data in csv file to int:
o=open('/dates_dist.csv', 'rU')
mydata = csv.reader(o)
def int_wrapper(mydata):
for v in reader:
yield map(int, v)
reader = int_wrapper(mydata)
Can't find how to do it further...
You could use itertools.groupby:
import itertools as IT
import csv
def int_wrapper(mydata):
for v in mydata:
yield map(int, v)
with open('data', 'rU') as o:
mydata = csv.reader(o)
header = next(mydata)
reader = int_wrapper(mydata)
for key, group in IT.groupby(reader, lambda row: (row[0]-1)//25+1):
year = key*25
total = sum(row[1] for row in group)
print(year, total)
yields
(1900, 10)
(1925, 43)
(1950, 15)
Note that 1900 to 1925 (inclusive) spans 26 years, not 25. So
if you want to group 25 years, given the way you are reporting the totals, you probably want the half-open interval (1900, 1925].
The expression row[0]//25 takes the year and integer divides by 25.
This number will be the same for all numbers in the range [1900, 1925).
To make the range half-open on the left, subtract and add 1: (row[0]-1)//25+1.
Here is my approach . Its definitely not the most engaging python code, but could be a way to achieve the desired output.
if __name__ == '__main__':
o=open('dates_dist.csv', 'rU')
lines = o.read().split("\n") # Create a list having each line of the file
out_dict = {}
curr_date = 0;
curr_count = 0
chunk_sz = 25; #years
if len(lines) > 0:
line_split = lines[0].split(",")
start_year = int(line_split[0])
curr_count = 0
# Iterate over each line of the file
for line in lines:
# Split at comma to get the year and the count.
# line_split[0] will be the year and line_split[1] will be the count.
line_split = line.split(",")
curr_year = int(line_split[0])
time_delta = curr_year-start_year
if time_delta<chunk_sz or time_delta == chunk_sz:
curr_count = curr_count + int(line_split[1])
else:
out_dict[start_year+chunk_sz] = curr_count
start_year = start_year+chunk_sz
curr_count = int(line_split[1])
#print curr_year , curr_count
out_dict[start_year+chunk_sz] = curr_count
print out_dict
You could create a dummy column and group by it after doing some integer division:
df['temp'] = df['A'] // 25
>>> df
A B temp
0 1900 10 76
1 1901 2 76
2 1903 5 76
3 1908 8 76
4 1910 25 76
5 1925 3 77
6 1926 4 77
7 1928 1 77
8 1950 10 78
>>> df.groupby('temp').sum()
A B
temp
76 9522 50
77 5779 8
78 1950 10
My numbers are slightly different from yours since I am technically grouping from 1900-1924, 1925-1949, and 1950-1974, but the idea is the same.

Categories

Resources