Reading flat file with combination of single- and multi-space delimiters - python

I have a space-delimited file in which the spaces are variable and I would like to know the ideal method for reading such files. I have tried Pandas and tried setting up many delimiters but nothing has worked so far.
Data format that I am currently working with:
STBID DOCUMENTNO DOCDATE CUSTID CT TOWNID PRDID PRD BATCHNO PRICE QUANTITY BONUS DISCOUNT AMOUNT NETAMOUNT REASON
642 752633 07-07-2021 0092 01 026 4419 OAD X-MEN TAB . 20S T-0987 1105.00 2 0 0.00 2210.00 2210.00 R
Data Format that I Need:
STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON
642,752633,07-07-2021,0092,01,026,4419,OAD X-MEN TAB . 20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R

you can try below ..
df = pd.read_csv("texttest", header=None)
print(df)
0
0 STBID DOCUMENTNO DOCDATE CUSTID CT TOWNID PRDID PRD BATCHNO PRICE QUANTITY BONUS DISCOUNT AMOUNT NETAMOUNT REASON
1 642 752633 07-07-2021 0092 01 026 4419 OAD X-MEN TAB . 20S T-0987 1105.00 2 0 0.00 2210.00 2210.00 R
Now use replace to convert spaces into comma..
df = df.replace(r'\s+', ',', regex=True)
print(df)
0
0 STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON
1 642,752633,07-07-2021,0092,01,026,4419,OAD,X-MEN,TAB,.,20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R
At last, save to a file without index and header.
df.to_csv('new_csv2',index=False,header=False)
$ cat new_csv1
"STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON"
"642,752633,07-07-2021,0092,01,026,4419,OAD,X-MEN,TAB,.,20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R"
Edit:
As data is not consistent, so for single occurrence you can do as a workaround, but this is not Dynamic, better to clean data at the processing.
df = df.replace(r"X-MEN,TAB,.,20S", "OAD X-MEN TAB . 20S", regex=True)
print(df)
0
0 STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON
1 642,752633,07-07-2021,0092,01,026,4419,OAD,OAD X-MEN TAB . 20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R

Here is a working solution considering the data as fixed width file, using pandas.read_fwf:
import re
import numpy as np # not strictly required
# read header
with open('multi_space.csv', 'r') as f:
header = f.readline()
# get starting positions for each word in the header
starts = [m.start() for m in re.finditer('\w+', header)]
# define colspecs (start,stop) for each column
cols = list(zip(starts, np.array(starts[1:]+[len(head)])-1))
## below alternative without numpy
# cols = list(zip(starts, [s-1 for s in starts[1:]+[len(head)]]))
# read fixed width
df = pd.read_fwf('multi_space.csv', colspecs=cols)
output:
STBID DOCUMENTNO DOCDATE CUSTID CT TOWNID PRDID PRD BATCHNO PRICE QUANTITY BONUS DISCOUNT AMOUNT NETAMOUNT REASON
0 642 752633 07-07-202 92 0 26 4419 OAD X-MEN TAB . 20S T-0987 1105.0 2 0 0.0 2210.0 2210.0 R
infos:
>>> df.infos()
RangeIndex: 1 entries, 0 to 0
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STBID 1 non-null int64
1 DOCUMENTNO 1 non-null int64
2 DOCDATE 1 non-null object
3 CUSTID 1 non-null int64
4 CT 1 non-null int64
5 TOWNID 1 non-null int64
6 PRDID 1 non-null int64
7 PRD 1 non-null object
8 BATCHNO 1 non-null object
9 PRICE 1 non-null float64
10 QUANTITY 1 non-null int64
11 BONUS 1 non-null int64
12 DISCOUNT 1 non-null float64
13 AMOUNT 1 non-null float64
14 NETAMOUNT 1 non-null float64
15 REASON 1 non-null object

Related

python pandas | replacing the date and time string with only time

price
quantity
high time
10.4
3
2021-11-08 14:26:00-05:00
dataframe = ddg
the datatype for hightime is datetime64[ns, America/New_York]
i want the high time to be only 14:26:00 (getting rid of 2021-11-08 and -05:00) but i got an error when using the code below
ddg['high_time'] = ddg['high_time'].dt.strftime('%H:%M')
I think because it's not the right column name:
# Your code
>>> ddg['high_time'].dt.strftime('%H:%M')
...
KeyError: 'high_time'
# With right column name
>>> ddg['high time'].dt.strftime('%H:%M')
0 14:26
Name: high time, dtype: object
# My dataframe:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 1 non-null float64
1 quantity 1 non-null int64
2 high time 1 non-null datetime64[ns, America/New_York]
dtypes: datetime64[ns, America/New_York](1), float64(1), int64(1)
memory usage: 152.0 bytes

pandas read csv ignore ending semicolon of last column

My data file looks like this:
data.txt
user,activity,timestamp,x-axis,y-axis,z-axis
0,33,Jogging,49105962326000,-0.6946376999999999,12.680544,0.50395286;
1,33,Jogging,49106062271000,5.012288,11.264028,0.95342433;
2,33,Jogging,49106112167000,4.903325,10.882658000000001,-0.08172209;
3,33,Jogging,49106222305000,-0.61291564,18.496431,3.0237172;
As can be seen, the last column ends with a semicolon, so when I read into pandas, the column is inferred as type object (ending with the semicolon.
df = pd.read_csv('data.txt')
df
user activity timestamp x-axis y-axis z-axis
0 33 Jogging 49105962326000 -0.694638 12.680544 0.50395286;
1 33 Jogging 49106062271000 5.012288 11.264028 0.95342433;
2 33 Jogging 49106112167000 4.903325 10.882658 -0.08172209;
3 33 Jogging 49106222305000 -0.612916 18.496431 3.0237172;
How do I make pandas ignore that semicolon?
The problem with your txt is that it has mixed content. As I can see the header doesn't have the semicolon as termination character
If you change the first line adding the semicolon it's quite simple
pd.read_csv("data.txt", lineterminator=";")
Might not be the case but it works given the example.
In the docs you could find comment param that:
indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#', parsing #empty\na,b,c\n1,2,3 with header=0 will result in ‘a,b,c’ being treated as the header.
So if ; could only be found at the end of your last column:
>>> df = pd.read_csv("data.txt", comment=";")
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user 4 non-null int64
1 activity 4 non-null object
2 timestamp 4 non-null int64
3 x-axis 4 non-null float64
4 y-axis 4 non-null float64
5 z-axis 4 non-null float64
dtypes: float64(3), int64(2), object(1)
memory usage: 224.0+ bytes
>>> df
user activity timestamp x-axis y-axis z-axis
0 33 Jogging 49105962326000 -0.694638 12.680544 0.503953
1 33 Jogging 49106062271000 5.012288 11.264028 0.953424
2 33 Jogging 49106112167000 4.903325 10.882658 -0.081722
3 33 Jogging 49106222305000 -0.612916 18.496431 3.023717
You can make use of converters param:
to parse your string
to replace ;
to convert to float
df = pd.read_csv('data.txt', sep=",", converters={"z-axis": lambda x: float(x.replace(";",""))})
print(df)
data txtuser activity timestamp x-axis y-axis z-axis
0 0 33 Jogging 49105962326000 -0.694638 12.680544 0.503953
1 1 33 Jogging 49106062271000 5.012288 11.264028 0.953424
2 2 33 Jogging 49106112167000 4.903325 10.882658 -0.081722
3 3 33 Jogging 49106222305000 -0.612916 18.496431 3.023717

Group by of dataframe with average of a column

I am really new to python..just a week ago started learning it. I have a query and hope you guys can help me to solve it. Thanks in advance..!!
I have data in below format.
Date Product Price Discount
1/1/2020 A 17,490 30
1/1/2020 B 34,990 21
1/1/2020 C 20,734 11
1/2/2020 A 16,884 26
1/2/2020 B 26,990 40
1/2/2020 C 17,936 10
1/3/2020 A 16,670 36
1/3/2020 B 12,990 13
1/3/2020 C 30,990 43
I want to take the average of discount column for each date and just have 2 columns.. It aint working out.. :(
Date AVG_Discount
1/1/2020 x %
1/2/2020 y %
1/3/2020 z %
What I have tried doing is below.. As I said, I am novice in Python so approach might be incorrect.. Need guidance guys.. TIA
mean_col=df.groupby(df['time'])['discount'].mean()
df=df.set_index(['time'])
df['mean_col']=mean_col
df=df.reset_index()
df.groupby(df['time'])['discount'].mean() Is already returning series with time as index.
All you need to do is just use reset_index function on this.
grouped_df = df.groupby(df['time'])['discount'].mean().reset_index()
As Quang Hoang Suggested in comments. You can also pass as_index=False to groupby.
Apparently, you have read your DataFrame from a text file,
e.g. CSV, but with separator other than a comma.
Run df.info() and I assume that you got result something like below:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null object
Product 9 non-null object
Price 9 non-null object
Discount 9 non-null int64
dtypes: int64(1), object(3)
Note that Date, Product and Price columns are of object type
(actually, a string). This remark is especially importoant in case of
Price column, because to compte mean you should have source column
as a number (not a string).
So first you should convert Date and Price columns to proper types
(datetime and float). To do it run:
df.Date = pd.to_datetime(df.Date)
df.Price = df.Price.str.replace(',', '.').astype(float)
Run df.info() again and now the result should be:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null datetime64[ns]
Product 9 non-null object
Price 9 non-null float64
Discount 9 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
And now you can compute the mean discount, running:
df.groupby('Date').Discount.mean()
For your data I got:
Date
2020-01-01 20.666667
2020-01-02 25.333333
2020-01-03 30.666667
Name: Discount, dtype: float64
Note that your code sample contains the following errors:
Argument of groupby is the column name (or a list of column names), so:
df between parentheses is not needed,
instead of time you should write Date (you have no time column).
Your Discount column is written starting with capital D.

readin data as float per converter

I have a csv-file called 'filename' and want to read in these data as 64float, except the column 'hour'. I managed it with the pd.read_csv - function and an converter.
df = pd.read_csv("../data/filename.csv",
delimiter = ';',
date_parser = ['hour'],
skiprows = 1,
converters={'column1': lambda x: float(x.replace ('.','').replace(',','.'))})
Now, I have two points:
FIRST:
The delimiter works with ; ,but if I take a look in Notepad to my data, there are ',', not ';'. But if I take ',' I get: 'pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 13, saw 9'
SECOND:
If I want to use the converter for all columns, how can I get this?! What`s the right term?
I try to use dtype = float in the readin-function, but I get 'AttributeError: 'NoneType' object has no attribute 'dtype'' Whats happend? Thats the reasion why I want to managed it with the converter.
Data:
,hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind
offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"
This should work:
In [40]:
# text data
temp=''',hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"'''
# so read the csv, pass params quotechar and the thousands character
df = pd.read_csv(io.StringIO(temp), quotechar='"', thousands=',')
df
Out[40]:
Unnamed: 0 hour PV Wind onshore Wind offshore PV.1 Wind onshore.1 \
0 0 1 0 12985.0 9614.0 0 32825.5
1 1 2 0 12908.9 9290.8 0 36052.3
2 2 3 0 12740.9 8886.9 0 38540.9
3 3 4 0 12485.3 8644.5 0 40734.0
4 4 5 0 11188.5 8079.0 0 42688.0
5 5 6 0 11219.0 7594.2 0 43333.5
Wind offshore.1 PV.2 Wind onshore.2 Wind offshore.2
0 9495.7 0 13110.3 10855.5
1 9589.1 0 13670.2 10828.6
2 10087.3 0 14610.8 10828.6
3 10087.3 0 15638.3 10343.7
4 10087.3 0 16809.4 10343.7
5 10025.0 0 18266.9 10343.7
In [41]:
# check the dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 11 columns):
Unnamed: 0 6 non-null int64
hour 6 non-null int64
PV 6 non-null float64
Wind onshore 6 non-null float64
Wind offshore 6 non-null float64
PV.1 6 non-null float64
Wind onshore.1 6 non-null float64
Wind offshore.1 6 non-null float64
PV.2 6 non-null float64
Wind onshore.2 6 non-null float64
Wind offshore.2 6 non-null float64
dtypes: float64(9), int64(2)
memory usage: 576.0 bytes
So basically you need to pass the quotechar='"' and thousands=',' params to read_csv to achieve what you want, see the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
EDIT
If you want to convert after importing (which is a waste when you can do it upfront) then you can do this for each column of interest:
In [43]:
# replace the comma separator
df['Wind onshore'] = df['Wind onshore'].str.replace(',','')
# convert the type
df['Wind onshore'] = df['Wind onshore'].astype(np.float64)
df['Wind onshore'].dtype
Out[43]:
dtype('float64')
It would be faster to replace the comma separator on all the columns of interest first and just call convert_objects like so: df.convert_objects(convert_numeric=True)

pandas update a dataframe column based on prefiltered groupby object

given dataframe d such as this:
index col1
1 a
2 a
3 b
4 b
Create a prefiltered group object with new values:
g = d[prefilter].groupby(['some cols']).apply( somefunc )
index col1
2 c
4 d
Now I want to update df to this:
index col1
1 a
2 c
3 b
4 d
Ive been hacking away with update, ix, filtering, where, etc... I am guessing there is an obvious solution I am not seeing here.
stuff like this is not working:
d[d.index == db.index]['alert_v'] = db['alert_v']
q90 = g.transform( somefunc )
d.ix[ d['alert_v'] >=q90, 'alert_v'] = 1
d.ix[ d['alert_v'] < q90, 'alert_v'] = 0
d['alert_v'] = np.where( d.index==db.index, db['alert_v'], d['alert_v'] )
any help is appreciated
thankyou
--edit--
the two dataframes are in the same form:
one is simply a filtered version of the other, with different values, that I want to update to the original.
ValueError: cannot reindex from a duplicate axis
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 2186 non-null object
subject_id 2186 non-null float64
alert_t 2186 non-null object
variable 2186 non-null object
timeindex 2186 non-null datetime64[ns]
alert_v 2105 non-null float64
value 2186 non-null float64
tavg 54 non-null timedelta64[ns]
iqt 61 non-null object
dtypes: datetime64[ns](1), float64(3), object(4), timedelta64[ns](1)None<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1982 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 1982 non-null object
subject_id 1982 non-null float64
alert_t 1982 non-null object
variable 1982 non-null object
timeindex 1982 non-null datetime64[ns]
alert_v 1982 non-null int64
value 1982 non-null float64
tavg 0 non-null timedelta64[ns]
iqt 0 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4), timedelta64[ns](1)None
you want the df.update() function.
Try something like this:
import pandas as pd
df1 = pd.DataFrame({'Index':[1,2,3,4],'Col1':['A', 'B', 'C', 'D']}).set_index('Index')
df2 = pd.DataFrame({'Index':[2,4],'Col1':['E', 'F']}).set_index('Index')
print df1
Col1
Index
1 A
2 B
3 C
4 D
df1.update(df2)
print df1
Col1
Index
1 A
2 E
3 C
4 F

Categories

Resources