pandas read csv ignore ending semicolon of last column - python

My data file looks like this:
data.txt
user,activity,timestamp,x-axis,y-axis,z-axis
0,33,Jogging,49105962326000,-0.6946376999999999,12.680544,0.50395286;
1,33,Jogging,49106062271000,5.012288,11.264028,0.95342433;
2,33,Jogging,49106112167000,4.903325,10.882658000000001,-0.08172209;
3,33,Jogging,49106222305000,-0.61291564,18.496431,3.0237172;
As can be seen, the last column ends with a semicolon, so when I read into pandas, the column is inferred as type object (ending with the semicolon.
df = pd.read_csv('data.txt')
df
user activity timestamp x-axis y-axis z-axis
0 33 Jogging 49105962326000 -0.694638 12.680544 0.50395286;
1 33 Jogging 49106062271000 5.012288 11.264028 0.95342433;
2 33 Jogging 49106112167000 4.903325 10.882658 -0.08172209;
3 33 Jogging 49106222305000 -0.612916 18.496431 3.0237172;
How do I make pandas ignore that semicolon?

The problem with your txt is that it has mixed content. As I can see the header doesn't have the semicolon as termination character
If you change the first line adding the semicolon it's quite simple
pd.read_csv("data.txt", lineterminator=";")

Might not be the case but it works given the example.
In the docs you could find comment param that:
indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#', parsing #empty\na,b,c\n1,2,3 with header=0 will result in ‘a,b,c’ being treated as the header.
So if ; could only be found at the end of your last column:
>>> df = pd.read_csv("data.txt", comment=";")
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user 4 non-null int64
1 activity 4 non-null object
2 timestamp 4 non-null int64
3 x-axis 4 non-null float64
4 y-axis 4 non-null float64
5 z-axis 4 non-null float64
dtypes: float64(3), int64(2), object(1)
memory usage: 224.0+ bytes
>>> df
user activity timestamp x-axis y-axis z-axis
0 33 Jogging 49105962326000 -0.694638 12.680544 0.503953
1 33 Jogging 49106062271000 5.012288 11.264028 0.953424
2 33 Jogging 49106112167000 4.903325 10.882658 -0.081722
3 33 Jogging 49106222305000 -0.612916 18.496431 3.023717

You can make use of converters param:
to parse your string
to replace ;
to convert to float
df = pd.read_csv('data.txt', sep=",", converters={"z-axis": lambda x: float(x.replace(";",""))})
print(df)
data txtuser activity timestamp x-axis y-axis z-axis
0 0 33 Jogging 49105962326000 -0.694638 12.680544 0.503953
1 1 33 Jogging 49106062271000 5.012288 11.264028 0.953424
2 2 33 Jogging 49106112167000 4.903325 10.882658 -0.081722
3 3 33 Jogging 49106222305000 -0.612916 18.496431 3.023717

Related

Low size pandas Dataframe expand when saved in csv

I have a pandas dataframe which I managed to downsize to 8GB (dropping columns and transforming many object column to .astype('category')). This is the output of .info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 61435992 entries, 0 to 64771731
Data columns (total 33 columns):
# Column Dtype
--- ------ -----
0 reftype object
1 confscore int8
2 paperid int64
3 patent_string object
4 uspto int8
5 wherefound category
6 doi object
7 pmid float32
8 patent int64
9 knowngood_dummy int8
10 paperyear float32
11 papertitle object
12 magfieldid float32
13 oecd_field category
14 oecd_subfield category
15 wosfield category
16 author float32
17 entity_id float32
18 affiliation2 float32
19 class category
20 foaf_name object
21 type_entities object
22 acronym category
23 pos#lat float32
24 pos#long float32
25 city_name category
26 city_lat float32
27 city_lon float32
28 state_name category
29 postcode category
30 country_name category
31 country_alpha3 category
32 country_2 category
dtypes: category(12), float32(10), int64(2), int8(3), object(6)
memory usage: 7.6+ GB
I then used dak to export it in csv form in a single file as follows:
dask_db_emakg_prova.to_csv('df_emakg.csv', index=False, single_file=True) # this save to one file
However, when exported in csv form on my local computer its size increases to 22GB. How is this possible? Is there a way to export it at the size of the pandas df? I need to import it in tata and R o this is why I would like a csv file, but with 22GB it takes a lot to open.
Thank you

Reading flat file with combination of single- and multi-space delimiters

I have a space-delimited file in which the spaces are variable and I would like to know the ideal method for reading such files. I have tried Pandas and tried setting up many delimiters but nothing has worked so far.
Data format that I am currently working with:
STBID DOCUMENTNO DOCDATE CUSTID CT TOWNID PRDID PRD BATCHNO PRICE QUANTITY BONUS DISCOUNT AMOUNT NETAMOUNT REASON
642 752633 07-07-2021 0092 01 026 4419 OAD X-MEN TAB . 20S T-0987 1105.00 2 0 0.00 2210.00 2210.00 R
Data Format that I Need:
STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON
642,752633,07-07-2021,0092,01,026,4419,OAD X-MEN TAB . 20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R
you can try below ..
df = pd.read_csv("texttest", header=None)
print(df)
0
0 STBID DOCUMENTNO DOCDATE CUSTID CT TOWNID PRDID PRD BATCHNO PRICE QUANTITY BONUS DISCOUNT AMOUNT NETAMOUNT REASON
1 642 752633 07-07-2021 0092 01 026 4419 OAD X-MEN TAB . 20S T-0987 1105.00 2 0 0.00 2210.00 2210.00 R
Now use replace to convert spaces into comma..
df = df.replace(r'\s+', ',', regex=True)
print(df)
0
0 STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON
1 642,752633,07-07-2021,0092,01,026,4419,OAD,X-MEN,TAB,.,20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R
At last, save to a file without index and header.
df.to_csv('new_csv2',index=False,header=False)
$ cat new_csv1
"STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON"
"642,752633,07-07-2021,0092,01,026,4419,OAD,X-MEN,TAB,.,20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R"
Edit:
As data is not consistent, so for single occurrence you can do as a workaround, but this is not Dynamic, better to clean data at the processing.
df = df.replace(r"X-MEN,TAB,.,20S", "OAD X-MEN TAB . 20S", regex=True)
print(df)
0
0 STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON
1 642,752633,07-07-2021,0092,01,026,4419,OAD,OAD X-MEN TAB . 20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R
Here is a working solution considering the data as fixed width file, using pandas.read_fwf:
import re
import numpy as np # not strictly required
# read header
with open('multi_space.csv', 'r') as f:
header = f.readline()
# get starting positions for each word in the header
starts = [m.start() for m in re.finditer('\w+', header)]
# define colspecs (start,stop) for each column
cols = list(zip(starts, np.array(starts[1:]+[len(head)])-1))
## below alternative without numpy
# cols = list(zip(starts, [s-1 for s in starts[1:]+[len(head)]]))
# read fixed width
df = pd.read_fwf('multi_space.csv', colspecs=cols)
output:
STBID DOCUMENTNO DOCDATE CUSTID CT TOWNID PRDID PRD BATCHNO PRICE QUANTITY BONUS DISCOUNT AMOUNT NETAMOUNT REASON
0 642 752633 07-07-202 92 0 26 4419 OAD X-MEN TAB . 20S T-0987 1105.0 2 0 0.0 2210.0 2210.0 R
infos:
>>> df.infos()
RangeIndex: 1 entries, 0 to 0
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STBID 1 non-null int64
1 DOCUMENTNO 1 non-null int64
2 DOCDATE 1 non-null object
3 CUSTID 1 non-null int64
4 CT 1 non-null int64
5 TOWNID 1 non-null int64
6 PRDID 1 non-null int64
7 PRD 1 non-null object
8 BATCHNO 1 non-null object
9 PRICE 1 non-null float64
10 QUANTITY 1 non-null int64
11 BONUS 1 non-null int64
12 DISCOUNT 1 non-null float64
13 AMOUNT 1 non-null float64
14 NETAMOUNT 1 non-null float64
15 REASON 1 non-null object

Group by of dataframe with average of a column

I am really new to python..just a week ago started learning it. I have a query and hope you guys can help me to solve it. Thanks in advance..!!
I have data in below format.
Date Product Price Discount
1/1/2020 A 17,490 30
1/1/2020 B 34,990 21
1/1/2020 C 20,734 11
1/2/2020 A 16,884 26
1/2/2020 B 26,990 40
1/2/2020 C 17,936 10
1/3/2020 A 16,670 36
1/3/2020 B 12,990 13
1/3/2020 C 30,990 43
I want to take the average of discount column for each date and just have 2 columns.. It aint working out.. :(
Date AVG_Discount
1/1/2020 x %
1/2/2020 y %
1/3/2020 z %
What I have tried doing is below.. As I said, I am novice in Python so approach might be incorrect.. Need guidance guys.. TIA
mean_col=df.groupby(df['time'])['discount'].mean()
df=df.set_index(['time'])
df['mean_col']=mean_col
df=df.reset_index()
df.groupby(df['time'])['discount'].mean() Is already returning series with time as index.
All you need to do is just use reset_index function on this.
grouped_df = df.groupby(df['time'])['discount'].mean().reset_index()
As Quang Hoang Suggested in comments. You can also pass as_index=False to groupby.
Apparently, you have read your DataFrame from a text file,
e.g. CSV, but with separator other than a comma.
Run df.info() and I assume that you got result something like below:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null object
Product 9 non-null object
Price 9 non-null object
Discount 9 non-null int64
dtypes: int64(1), object(3)
Note that Date, Product and Price columns are of object type
(actually, a string). This remark is especially importoant in case of
Price column, because to compte mean you should have source column
as a number (not a string).
So first you should convert Date and Price columns to proper types
(datetime and float). To do it run:
df.Date = pd.to_datetime(df.Date)
df.Price = df.Price.str.replace(',', '.').astype(float)
Run df.info() again and now the result should be:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null datetime64[ns]
Product 9 non-null object
Price 9 non-null float64
Discount 9 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
And now you can compute the mean discount, running:
df.groupby('Date').Discount.mean()
For your data I got:
Date
2020-01-01 20.666667
2020-01-02 25.333333
2020-01-03 30.666667
Name: Discount, dtype: float64
Note that your code sample contains the following errors:
Argument of groupby is the column name (or a list of column names), so:
df between parentheses is not needed,
instead of time you should write Date (you have no time column).
Your Discount column is written starting with capital D.

I lose my values in the columns

I've organized my data using pandas. and I fill my procedure out like below
import pandas as pd
import numpy as np
df1 = pd.read_table(r'E:\빅데이터 캠퍼스\골목상권 프로파일링 - 서울 열린데이터 광장 3.초기-16년5월분1\17.상권-추정매출\201301-201605\tbsm_trdar_selng.txt\tbsm_trdar_selng_utf8.txt' , sep='|' ,header=None
,dtype = { '0' : pd.np.int})
df1 = df1.replace('201301', int(201301))
df2 = df1[[0 ,1, 2, 3 ,4, 11,12 ,82 ]]
df2_rename = df2.columns = ['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD', 'SVC_INDUTY_CD_NM', 'THSMON_SELNG_AMT', 'THSMON_SELNG_CO', 'STOR_CO' ]
print(df2.head(40))
df3_groupby = df2.groupby(['STDR_YM_CD', 'TRDAR_CD' ])
df4_agg = df3_groupby.agg(np.sum)
print(df4_agg.head(30))
When I print df2 I can see the 11947 and 11948 values in my TRDAR_CD column. like below picture
after that, I used groupby function and I lose my 11948 values in my TRDAR_CD column. You can see this situation in below picture
probably, this problem from the warning message?? warning message is 'sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.'
help me plz
print(df2.info()) is
RangeIndex: 1089023 entries, 0 to 1089022
Data columns (total 8 columns):
STDR_YM_CD 1089023 non-null object
TRDAR_CD 1089023 non-null int64
TRDAR_CD_NM 1085428 non-null object
SVC_INDUTY_CD 1089023 non-null object
SVC_INDUTY_CD_NM 1089023 non-null object
THSMON_SELNG_AMT 1089023 non-null int64
THSMON_SELNG_CO 1089023 non-null int64
STOR_CO 1089023 non-null int64
dtypes: int64(4), object(4)
memory usage: 66.5+ MB
None
MultiIndex is called first and second columns and if first level has duplicates by default it 'sparsified' the higher levels of the indexes to make the console output a bit easier on the eyes.
You can show data in first level of MultiIndex by setting display.multi_sparse to False.
Sample:
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6],
'C':[7,8,9]})
df.set_index(['A','B'], inplace=True)
print (df)
C
A B
1 4 7
5 8
3 6 9
#temporary set multi_sparse to False
#http://pandas.pydata.org/pandas-docs/stable/options.html#getting-and-setting-options
with pd.option_context('display.multi_sparse', False):
print (df)
C
A B
1 4 7
1 5 8
3 6 9
EDIT by edit of question:
I think problem is type of value 11948 is string, so it is omited.
EDIT1 by file:
You can simplify your solution by add parameter usecols in read_csv and then aggregating by GroupBy.sum:
import pandas as pd
import numpy as np
df2 = pd.read_table(r'tbsm_trdar_selng_utf8.txt' ,
sep='|' ,
header=None ,
usecols=[0 ,1, 2, 3 ,4, 11,12 ,82],
names=['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD', 'SVC_INDUTY_CD_NM', 'THSMON_SELNG_AMT', 'THSMON_SELNG_CO', 'STOR_CO'],
dtype = { '0' : int})
df4_agg = df2.groupby(['STDR_YM_CD', 'TRDAR_CD' ]).sum()
print(df4_agg.head(10))
THSMON_SELNG_AMT THSMON_SELNG_CO STOR_CO
STDR_YM_CD TRDAR_CD
201301 11947 1966588856 74798 73
11948 3404215104 89064 116
11949 1078973946 42005 45
11950 1759827974 93245 71
11953 779024380 21042 84
11954 2367130386 94033 128
11956 511840921 23340 33
11957 329738651 15531 50
11958 1255880439 42774 118
11962 1837895919 66692 68

readin data as float per converter

I have a csv-file called 'filename' and want to read in these data as 64float, except the column 'hour'. I managed it with the pd.read_csv - function and an converter.
df = pd.read_csv("../data/filename.csv",
delimiter = ';',
date_parser = ['hour'],
skiprows = 1,
converters={'column1': lambda x: float(x.replace ('.','').replace(',','.'))})
Now, I have two points:
FIRST:
The delimiter works with ; ,but if I take a look in Notepad to my data, there are ',', not ';'. But if I take ',' I get: 'pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 13, saw 9'
SECOND:
If I want to use the converter for all columns, how can I get this?! What`s the right term?
I try to use dtype = float in the readin-function, but I get 'AttributeError: 'NoneType' object has no attribute 'dtype'' Whats happend? Thats the reasion why I want to managed it with the converter.
Data:
,hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind
offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"
This should work:
In [40]:
# text data
temp=''',hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"'''
# so read the csv, pass params quotechar and the thousands character
df = pd.read_csv(io.StringIO(temp), quotechar='"', thousands=',')
df
Out[40]:
Unnamed: 0 hour PV Wind onshore Wind offshore PV.1 Wind onshore.1 \
0 0 1 0 12985.0 9614.0 0 32825.5
1 1 2 0 12908.9 9290.8 0 36052.3
2 2 3 0 12740.9 8886.9 0 38540.9
3 3 4 0 12485.3 8644.5 0 40734.0
4 4 5 0 11188.5 8079.0 0 42688.0
5 5 6 0 11219.0 7594.2 0 43333.5
Wind offshore.1 PV.2 Wind onshore.2 Wind offshore.2
0 9495.7 0 13110.3 10855.5
1 9589.1 0 13670.2 10828.6
2 10087.3 0 14610.8 10828.6
3 10087.3 0 15638.3 10343.7
4 10087.3 0 16809.4 10343.7
5 10025.0 0 18266.9 10343.7
In [41]:
# check the dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 11 columns):
Unnamed: 0 6 non-null int64
hour 6 non-null int64
PV 6 non-null float64
Wind onshore 6 non-null float64
Wind offshore 6 non-null float64
PV.1 6 non-null float64
Wind onshore.1 6 non-null float64
Wind offshore.1 6 non-null float64
PV.2 6 non-null float64
Wind onshore.2 6 non-null float64
Wind offshore.2 6 non-null float64
dtypes: float64(9), int64(2)
memory usage: 576.0 bytes
So basically you need to pass the quotechar='"' and thousands=',' params to read_csv to achieve what you want, see the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
EDIT
If you want to convert after importing (which is a waste when you can do it upfront) then you can do this for each column of interest:
In [43]:
# replace the comma separator
df['Wind onshore'] = df['Wind onshore'].str.replace(',','')
# convert the type
df['Wind onshore'] = df['Wind onshore'].astype(np.float64)
df['Wind onshore'].dtype
Out[43]:
dtype('float64')
It would be faster to replace the comma separator on all the columns of interest first and just call convert_objects like so: df.convert_objects(convert_numeric=True)

Categories

Resources