I am trying to remove or delete headers of data I am reading in using pandas. One file has a header and the other doesn't but I want to be able to check for headers and then remove it.
So far, I have tried using header=None in the read_csv function
from pathlib import Path
import pandas as pd
def _reader(fname):
return pd.read_csv(fname, sep="\t", header=None)
folder = Path("C:\\Me\\Project1")
data = pd.concat([
_reader(txt)
for txt in folder.glob("*.txt")
])
I get the following error:
TypeError: must be str, not int
My two files look like this:
File1.txt
ISIN AVL_QTY
BAD 90000
AAB 8550000
BAD 173688
BAD 360000
BAD 90000
BAD 810000
BAD 900000
BAD 900000
File2.txt
TEST 543
HELLO 555
STOCK 900
CODE 785
First, you need a check if the first line is a header. E.g. you can check if any of the first row's entries begins with a number, as this would not be typical for a column header.
In fact without knowing your thousands of files the right approach for header detection is just guessing - but that's not really the point in your code.
To make use of a header detection, you should go with normal loop instead of a list comprehension, so that you can in each iteration: 1. check for header 2. read the file and append the data to a dataframe:
df = pd.DataFrame()
for f in folder.glob("*.txt"):
with open(f) as fin:
chk_lst = next(fin).split()
is_h = not any(v[0].isdecimal() for v in chk_lst)
df = pd.concat([df, pd.read_csv(f, sep='\s+', header=(None, 0)[is_h])], axis=1)
# ISIN AVL_QTY 0 1
# 0 BAD 90000 TEST 543.775
# 1 AAB 8550000 HELLO 555.000
# 2 BAD 173688 STOCK 900.000
# 3 BAD 360000 CODE 785.000
# 4 BAD 90000 NaN NaN
# 5 BAD 810000 NaN NaN
# 6 BAD 900000 NaN NaN
# 7 BAD 900000 NaN NaN
Edit:
For concatenating row wise, you can use
df = pd.concat([df, pd.read_csv(f, sep='\s+', header=None, skiprows=(0, 1)[is_h])], axis=0, ignore_index=True)
# 0 1
# 0 BAD 90000
# 1 AAB 8550000
# 2 BAD 173688
# 3 BAD 360000
# 4 BAD 90000
# 5 BAD 810000
# 6 BAD 900000
# 7 BAD 900000
# 8 TEST 543
# 9 HELLO 555
# 10 STOCK 900
# 11 CODE 785
File2.txt does not have header, right? but in _reader you set header as None.
Add header to File2.txt and see what happens.
There are a couple ways to check if a csv file has a header
using the csv library
import csv
with open('example.csv', 'rb') as csvfile:
sniffer = csv.Sniffer()
has_header = sniffer.has_header(csvfile.read(2048))
csvfile.seek(0)
# ...
my source
or if you know your data, checking if there are any digits in the first row
is_header = not any(cell.isdigit() for cell in csv_table[0])
my source
or with pandas itself, if you know what the header might be called
df = (pd.read_csv(filename, header=None, names=cols)
[lambda x: np.ones(len(x)).astype(bool)
if (x.iloc[0] != cols).all()
else np.concatenate([[False], np.ones(len(x)-1).astype(bool)])]
)
my source
and of course if you want to preprocess the files with the command line first, it'll probably be faster....
Related
The input is many JSON files differing in structure, and the desired output is a single dataframe.
Input Description:
Each JSON file may have 1 or many attackers and exactly 1 victim. The attackers key points to a list of dictionaries. Each dictionary is 1 attacker with keys such as character_id, corporation_id, alliance_id, etc. The victim key points to dictionary with similar keys. Important thing to note here is that the keys might differ between the same JSON. For example, a JSON file may have attackers key which looks like this:
{
"attackers": [
{
"alliance_id": 99005678,
"character_id": 94336577,
"corporation_id": 98224639,
"damage_done": 3141,
"faction_id": 500003,
"final_blow": true,
"security_status": -9.4,
"ship_type_id": 73796,
"weapon_type_id": 3178
},
{
"damage_done": 1614,
"faction_id": 500003,
"final_blow": false,
"security_status": 0,
"ship_type_id": 32963
}
],
...
Here the JSON file has 2 attackers. But only the first attacker has the afore-mentioned keys. Similarly, the victim may look like this:
...
"victim": {
"character_id": 2119076173,
"corporation_id": 98725195,
"damage_taken": 4755,
"faction_id": 500002,
"items": [...
...
Output Description:
As an output I want to create a dataframe from many (about 400,000) such JSON files stored in the same directory. Each row of the resulting dataframe should have 1 attacker and 1 victim. JSONs with multiple attackers should be split into equal number of rows, where the attackers' properties are different, but the victim properties are the same. For e.g., 3 rows if there are 3 attackers and NaN values where a certain attacker doesn't have a key-value pair. So, the character_id for the second attacker in the dataframe of the above example should be NaN.
Current Method:
To achieve this, I first create an empty list. Then iterate through all the files, open them, load them as JSON objects, convert to dataframe then append dataframe to the list. Please note that pd.DataFrame([json.load(fi)]) has the same output as pd.json_normalize(json.load(fi)).
mainframe = []
for file in tqdm(os.listdir("D:/Master/killmails_jul"), ncols=100, ascii=' >'):
with open("%s/%s" % ("D:/Master/killmails_jul", file),'r') as fi:
mainframe.append(pd.DataFrame([json.load(fi)]))
After this loop, I am left with a list of dataframes which I concatenate using pd.concat().
mainframe = pd.concat(mainframe)
As of yet, the dataframe only has 1 row per JSON irrespective of the number of attackers. To fix this, I use pd.explode() in the next step.
mainframe = mainframe.explode('attackers')
mainframe.reset_index(drop=True, inplace=True)
Now I have separate rows for each attacker, however the attackers & victim keys are still hidden in their respective column. To fix this I 'explode' the the two columns horizontally by pd.apply(pd.Series) and apply prefix for easy recognition as follows:
intframe = mainframe["attackers"].apply(pd.Series).add_prefix("attackers_").join(mainframe["victim"].apply(pd.Series).add_prefix("victim_"))
In the next step I join this intermediate frame with the mainframe to retain the killmail_id and killmail_hash columns. Then remove the attackers & victim columns as I have now expanded them.
mainframe = intframe.join(mainframe)
mainframe.fillna(0, inplace=True)
mainframe.drop(['attackers','victim'], axis=1, inplace=True)
This gives me the desired output with the following 24 columns:
['attackers_character_id', 'attackers_corporation_id', 'attackers_damage_done', 'attackers_final_blow', 'attackers_security_status', 'attackers_ship_type_id', 'attackers_weapon_type_id', 'attackers_faction_id', 'attackers_alliance_id', 'victim_character_id', 'victim_corporation_id', 'victim_damage_taken', 'victim_items', 'victim_position', 'victim_ship_type_id', 'victim_alliance_id', 'victim_faction_id', 'killmail_id', 'killmail_time', 'solar_system_id', 'killmail_hash', 'http_last_modified', 'war_id', 'moon_id']
Question:
Is there a better way to do this than I am doing right now? I tried to use generators but couldn't get them to work. I get an AttributeError: 'str' object has no attribute 'read'
all_files_paths = glob(os.path.join('D:\\Master\\kmrest', '*.json'))
def gen_df(files):
for file in files:
with open(file, 'r'):
data = json.load(file)
data = pd.DataFrame([data])
yield data
mainframe = pd.concat(gen_df(all_files_paths), ignore_index=True)
Will using the pd.concat() function with generators lead to quadratic copying?
Also, I am worried opening and closing many files is slowing down computation. Maybe it would be better to create a JSONL file from all the JSONs first and then creating a dataframe for each line.
If you'd like to get your hands on the files, I am trying to work with you can click here. Let me know if further information is needed.
You could use pd.json_normalize() to help with the heavy lifting:
First, load your data:
import json
import requests
import tarfile
from tqdm.notebook import tqdm
url = 'https://data.everef.net/killmails/2022/killmails-2022-11-22.tar.bz2'
with requests.get(url, stream=True) as r:
fobj = io.BytesIO(r.raw.read())
with tarfile.open(fileobj=fobj, mode='r:bz2') as tar:
json_files = [it for it in tar if it.name.endswith('.json')]
data = [json.load(tar.extractfile(it)) for it in tqdm(json_files)]
To do the same with your files:
import json
from glob import glob
def json_load(filename):
with open(filename) as f:
return json.load(f)
topdir = '...' # the dir containing all your json files
data = [json_load(fn) for fn in tqdm(glob(f'{topdir}/*.json'))]
Once you have a list of dicts in data:
others = ['killmail_id', 'killmail_hash']
a = pd.json_normalize(data, 'attackers', others, record_prefix='attackers.')
v = pd.json_normalize(data).drop('attackers', axis=1)
df = a.merge(v, on=others)
Some quick inspection:
>>> df.shape
(44903, 26)
# check:
>>> sum([len(d['attackers']) for d in data])
44903
>>> df.columns
Index(['attackers.alliance_id', 'attackers.character_id',
'attackers.corporation_id', 'attackers.damage_done',
'attackers.final_blow', 'attackers.security_status',
'attackers.ship_type_id', 'attackers.weapon_type_id',
'attackers.faction_id', 'killmail_id', 'killmail_hash', 'killmail_time',
'solar_system_id', 'http_last_modified', 'victim.alliance_id',
'victim.character_id', 'victim.corporation_id', 'victim.damage_taken',
'victim.items', 'victim.position.x', 'victim.position.y',
'victim.position.z', 'victim.ship_type_id', 'victim.faction_id',
'war_id', 'moon_id'],
dtype='object')
>>> df.iloc[:5, :5]
attackers.alliance_id attackers.character_id attackers.corporation_id attackers.damage_done attackers.final_blow
0 99007887.0 1.450608e+09 2.932806e+08 1426 False
1 99010931.0 1.628193e+09 5.668252e+08 1053 False
2 99007887.0 1.841341e+09 1.552312e+09 1048 False
3 99007887.0 2.118406e+09 9.872458e+07 662 False
4 99005839.0 9.573650e+07 9.947834e+08 630 False
>>> df.iloc[-5:, -5:]
victim.position.z victim.ship_type_id victim.faction_id war_id moon_id
44898 1.558110e+11 670 NaN NaN NaN
44899 -7.678686e+10 670 NaN NaN NaN
44900 -7.678686e+10 670 NaN NaN NaN
44901 -7.678686e+10 670 NaN NaN NaN
44902 -7.678686e+10 670 NaN NaN NaN
Note also that, as desired, missing keys for attackers are NaN:
>>> df.iloc[15:20, :2]
attackers.alliance_id attackers.character_id
15 99007887.0 2.117497e+09
16 99011893.0 1.593514e+09
17 NaN 9.175132e+07
18 NaN 2.119191e+09
19 99011258.0 1.258332e+09
I have a .dat file which looks something like the below....
#| step | Channel| Mode | Duration|Freq.| Amplitude | Phase|
0 1 AWG Pi/2 100 2 1
1 1 SIN^2 100 1 1
2 1 SIN^2 200 0.5 1
3 1 REC 50 100 1 1
100 0 REC Pi/2 150 1 1
I had created a data frame and I wanted to read extract data from the data frame but I have an error
TypeError: expected str, bytes or os.PathLike object, not DataFrame
My code is below here,
import pandas as pd
import numpy as np
path = "updated.dat"
datContent = [i.strip().split() for i in open(path).readlines()]
#print(datContent)
column_names = datContent.pop(0)
print(column_names)
df = pd.DataFrame(datContent)
print(df)
extract_column = df.iloc[:,2]
with open (df, 'r') as openfile :
for line in openfile:
for column_search in line:
column_search = df.iloc[:,2]
if "REC" in column_search:
print ("Rec found")
Any suggestion would be appreciated
Since your post does not have any clear question, I have to guess based on your code. I am assuming that what you want to get is to find all rows in DataFrame where column Mode contains value REC.
Based on that, I prepared a small, self contained example that works on your data.
In your situation, the only line that you should use is the last one. Assuming that your DataFrame is created and filled correctly, your code below print(df) can be exchanged by this single line.
I would really recommend you reading the official documentation about indexing and selecting data from DataFrames. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
import pandas as pd
from io import StringIO
data = StringIO("""
no;step;Channel;Mode;Duration;Freq.;Amplitude;Phase
;0;1;AWG;Pi/2;100;2;1
;1;1;SIN^2;;100;1;1
;2;1;SIN^2;;200;0.5;1
;3;1;REC;50;100;1;1
;100;0;REC;Pi/2;150;1;1
""")
df = pd.read_csv(data, sep=";")
df.loc[df.loc[:, 'Mode'] == "REC", :]
I have a series of very messy *.csv files that are being read in by pandas. An example csv is:
Instrument 35392
"Log File Name : station"
"Setup Date (MMDDYY) : 031114"
"Setup Time (HHMMSS) : 073648"
"Starting Date (MMDDYY) : 031114"
"Starting Time (HHMMSS) : 090000"
"Stopping Date (MMDDYY) : 031115"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000200"
"Circltr warmup (HHMMSS) : 000200"
"Date","Time","","Temp","","SpCond","","Sal","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","Volts",""
"Random message here 031114 073721 to 031114 083200"
03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,""
03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,""
03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,""
03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,""
03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,""
03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,""
"message 3"
"message 4"**
I have been using this code to import the *csv file, process the double headers, pull out the empty columns, and then strip the offending rows with bad data:
DF = pd.read_csv(BADFILE,parse_dates={'Datetime_(ascii)': [0,1]}, sep=",", \
header=[10,11],na_values=['','na', 'nan nan'], \
skiprows=[10], encoding='cp1252')
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2)
droplist = ['message', 'Random']
DF = DF[~DF['Datetime_(ascii)'].str.contains('|'.join(droplist))]
DF.head()
Datetime_(ascii) (Temp, øC) (SpCond, mS/cm) (Sal, ppt) (IBatt, Volts)
0 03/11/14 09:00:00 15.85 1.408 0.74 6.2
1 03/11/14 10:00:00 15.99 1.960 1.05 6.3
2 03/11/14 11:00:00 14.20 40.800 26.12 6.2
3 03/11/14 12:00:01 14.20 41.700 26.77 6.2
4 03/11/14 13:00:00 14.50 41.300 26.52 6.2
This was working fine and dandy until I have a file that has an erronious 1 row line after the header: "Random message here 031114 073721 to 031114 083200"
The error I receieve is:
*C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _do_date_conversions(self, names, data)
1554 data, names = _process_date_conversion(
1555 data, self._date_conv, self.parse_dates, self.index_col,
-> 1556 self.index_names, names,
keep_date_col=self.keep_date_col)
1557
1558 return names, data
C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _process_date_conversion(data_dict,
converter, parse_spec, index_col, index_names, columns, keep_date_col)
2975 if not keep_date_col:
2976 for c in list(date_cols):
-> 2977 data_dict.pop(c)
2978 new_cols.remove(c)
2979
KeyError: ('Time', 'HHMMSS')*
If I remove that line, the code works fine. Similarly, if I remove the header= line the code works fine. However, I want to be able to preserve this because I am reading in hundreds of these files.
Difficulty: I would prefer to not open each file before the call to pandas.read_csv() as these files can be rather large - thus I don't want to read and save multiple times! Also, I would prefer a real pandas/pythonic solution that doesn't involve openning the file first as a stringIO buffer to removing offending lines.
Here's one approach, making use of the fact that skip_rows accepts a callable function. The function receives only the row index being considered, which is a built-in limitation of that parameter.
As such, the callable function skip_test() first checks whether the current index is in the set of known indices to skip. If not, then it opens the actual file and checks the corresponding row to see if its contents match.
The skip_test() function is a little hacky in the sense that it does inspect the actual file, although it only inspects up until the current row index it's evaluating. It also assumes that the bad line always begins with the same string (in the example case, "foo"), but that seems to be a safe assumption given OP.
# example data
""" foo.csv
uid,a,b,c
0,1,2,3
skip me
1,11,22,33
foo
2,111,222,333
"""
import pandas as pd
def skip_test(r, fn, fail_on, known):
if r in known: # we know we always want to skip these
return True
# check if row index matches problem line in file
# for efficiency, quit after we pass row index in file
f = open(fn, "r")
data = f.read()
for i, line in enumerate(data.splitlines()):
if (i == r) & line.startswith(fail_on):
return True
elif i > r:
break
return False
fname = "foo.csv"
fail_str = "foo"
known_skip = [2]
pd.read_csv(fname, sep=",", header=0,
skiprows=lambda x: skip_test(x, fname, fail_str, known_skip))
# output
uid a b c
0 0 1 2 3
1 1 11 22 33
2 2 111 222 333
If you know exactly which line the random message will appear on when it does appear, then this will be much faster, as you can just tell it not to inspect the file contents for any index past the potential offending line.
After some tinkering yesterday I found a solution and what the potential issue may be.
I tried the skip_test() function answer above, but I was still getting errors with the size of the table:
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 11
So after playing around with skiprows= I discovered that I was just not getting the behavior I wanted when using the engine='c'. read_csv() was still determining the size of the file from those first few rows, and some of those single column rows were still being passed. It may be that I have a few more bad single column rows in my csv set that I did not plan on.
Instead, I create an arbitrary sized DataFrame as a template. I pull in the entire .csv file, then use logic to strip out the NaN rows.
For example, I know that the largest table that I will encounter with my data will be 10 rows long. So my call to pandas is:
DF = pd.read_csv(csv_file, sep=',', \
parse_dates={'Datetime_(ascii)': [0,1]},\
na_values=['','na', '999999', '#'], engine='c',\
encoding='cp1252', names = list(range(0,10)))
I then use these two lines to drop the NaN rows and columns from the DataFrame:
#drop the null columns created by double deliminators
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2) # drop if we don't have at least 2 cells with real values
If anyone in the future comes across this question, pandas has now implemented the on_bad_lines argument. You can now solve this problem by using on_bad_lines = "skip"
l have the following csv file that l process as follow
import pandas as pd
df = pd.read_csv('file.csv', sep=',',header=None)
id ocr raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d A 3
000a7b30-4c4f-4756-a757-f688ccc55d5d A /c
000b08e3-4129-4fd2-8ec0-23d00fe38a45 A yes
00196436-12bc-4024-b623-25bac586d314 A know
001b8c43-3e73-43c1-ba4f-df5edb10dfac A hi
002882ca-48bb-4161-a75a-cf0ec984d650 A fd
003b2890-3727-4c79-955a-f74ec6945ed7 A Sensible
004d9025-86f0-4f8c-9720-01e3385c5e77 A 2015
Now l want to add a new column :
df['val']=None
for img in images:
id, ext = img.rsplit('.',1)
idx = df[df[0] ==id].index.values
df.loc[df.index[idx], 'val'] = id
When l write df in a new file as follow :
df.to_csv('new_file.csv', sep=',',encoding='utf-8')
l noticed that the column is correctly added and filled. But the column remains without name and it's supposed to be named val
id ocr raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d A 3 4
000a7b30-4c4f-4756-a757-f688ccc55d5d A /c 3
000b08e3-4129-4fd2-8ec0-23d00fe38a45 A yes 1
00196436-12bc-4024-b623-25bac586d314 A know 8
001b8c43-3e73-43c1-ba4f-df5edb10dfac A hi 9
002882ca-48bb-4161-a75a-cf0ec984d650 A fd 10
003b2890-3727-4c79-955a-f74ec6945ed7 A Sensible 14
How to set set to the last column added ?
EDIT1:
print(df.head())
0 1 2 3
0 id ocr raw_value manual_raw_value
1 00037625-4706-4dfe-a7b3-de8c47e3a28d ABBYY 03 03
2 000a7b30-4c4f-4756-a757-f688ccc55d5d ABBYY y/c y/c
3 000b08e3-4129-4fd2-8ec0-23d00fe38a45 ABBYY armoire armoire
4 00196436-12bc-4024-b623-25bac586d314 ABBYY point point
val
0 None
1 93
2 yic
3 armoire
4 point
Need only read_csv, because sep=',' is by default and can be omit and header=None is used if csv have no header:
df = pd.read_csv('file.csv')
Problem is your first row was not parsed to columns names, but to first data row.
df = pd.read_csv('file.csv', sep=',', header=0, index_col=0)
should allow you to simplify the next portion to
df['val']=None
for img in images:
image_id, ext = img.rsplit('.',1)
df.loc[image_id, 'val'] = image_id
If you don't need the image_id as index afterwards, use df.reset_index(inplace=True)
one easy way...
before to_csv:
df.columns.value[3]="val"
I am trying to import a weirdly formatted text file into a pandas DataFrame. Two example lines are below:
LOADED LANE 1 MAT. TYPE= 2 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.474 LOADEFFECT 5075. LMAX= 3643. COV= .13
LOADED LANE 1 MAT. TYPE= 3 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.515 LOADEFFECT10009. LMAX= 9732. COV= .08
First I tried the following:
df = pd.read_csv('beta.txt', header=None, delim_whitespace=True, usecols=[2,5,7,9,11,13,15,17,19])
This seemed to work fine, however got messed up when it hit the above example line, where there is no whitespace after the LOADEFFECT string (you may need to scroll a bit right to see it in the example). I got a result like:
632 1 2 1 200 10 3.474 5075. 3643. 0.13
633 1 3 1 200 10 3.515 LMAX= COV= NaN
Then I decided to use a regular expression to define my delimiters. After many trial and error runs (I am no expert in regex), I managed to get close with the following line:
df = pd.read_csv('beta.txt', header=None, sep='/s +|LOADED LANE|MAT. TYPE=|LEFFECT=|SPAN=|SPACE=|BETA=|LOADEFFECT|LMAX=|COV=', engine='python')
This almost works, but creates a NaN column for some reason at the very beginning:
632 NaN 1 2 1 200 10 3.474 5075 3643 0.13
633 NaN 1 3 1 200 10 3.515 10009 9732 0.08
At this point I think I can just delete that first column, and get away with it. However I wonder what would be the correct way to set up the regex to correctly parse this text file in one shot. Any ideas? Other than that, I am sure there is a smarter way to parse this text file. I would be glad to hear your recommendations.
Thanks!
import re
import pandas as pd
import csv
csvfile = open("parsing.txt") #open text file
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
for i in line:
new_list.append(re.findall(r'(\d*\.\d+|\d+)', i))
table = pd.DataFrame(new_list)
table # output will be pandas DataFrame with values