Low size pandas Dataframe expand when saved in csv - python

I have a pandas dataframe which I managed to downsize to 8GB (dropping columns and transforming many object column to .astype('category')). This is the output of .info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 61435992 entries, 0 to 64771731
Data columns (total 33 columns):
# Column Dtype
--- ------ -----
0 reftype object
1 confscore int8
2 paperid int64
3 patent_string object
4 uspto int8
5 wherefound category
6 doi object
7 pmid float32
8 patent int64
9 knowngood_dummy int8
10 paperyear float32
11 papertitle object
12 magfieldid float32
13 oecd_field category
14 oecd_subfield category
15 wosfield category
16 author float32
17 entity_id float32
18 affiliation2 float32
19 class category
20 foaf_name object
21 type_entities object
22 acronym category
23 pos#lat float32
24 pos#long float32
25 city_name category
26 city_lat float32
27 city_lon float32
28 state_name category
29 postcode category
30 country_name category
31 country_alpha3 category
32 country_2 category
dtypes: category(12), float32(10), int64(2), int8(3), object(6)
memory usage: 7.6+ GB
I then used dak to export it in csv form in a single file as follows:
dask_db_emakg_prova.to_csv('df_emakg.csv', index=False, single_file=True) # this save to one file
However, when exported in csv form on my local computer its size increases to 22GB. How is this possible? Is there a way to export it at the size of the pandas df? I need to import it in tata and R o this is why I would like a csv file, but with 22GB it takes a lot to open.
Thank you

Related

How to group by and plot sales data after grouping by month and year, and sales amount for each part in python?

Below is my current code. I can get everything to run, except for the plot. Any suggestions would be greatly appreciated. The data being used is sales data over the coarse of 5 years. Column headings are:
Column Dtype
0 internal_fl object
1 invoicedate object
2 refkey3 object
3 refkey_desc object
4 ordernbr int64
5 orderlinenbr int64
6 ecwebordnbr object
7 ordertype object
8 order_source_group object
9 order_source_desc object
10 claim_no float64
11 resp object
12 cl_type object
13 credit_type object
14 acmodel object
15 acserial object
16 customer int64
17 cust_type_desc object
18 cust_type_group object
19 qty int64
20 salesamount_usd float64
21 item object
22 non_item float64
23 listprice float64
24 custprice float64
25 addl_disc_fl object
26 unitprice float64
27 disccode object
28 pricelist object
29 pct float64
30 alphacode object
31 purchmajcl object
32 policycode object
import pandas as pd
import csv
import os
df = pd.read_csv (r'/Users/me/Desktop/Part Sales.csv')
df_2 = pd.read_csv (r'/Users/me/Desktop/Part Sales2.csv')
concatenated = pd.concat([df, df_2]).drop_duplicates().reset_index(drop=True)
concatenated.head(10)
import matplotlib.pyplot as plt
import seaborn as sns
concatenated.shape
concatenated.columns
concatenated.info()
concatenated.isnull().sum()
concatenated.describe()
concatenated['invoicedate'].min()
concatenated['invoicedate'].max()
concatenated['month_year'] = concatenated['invoicedate'].apply(lambda x: '/'.join(x.split('/')[0:2]))
concatenated['month_year']
concatenated_trend = concatenated.groupby(['month_year'])['salesamount_usd'].sum() #reset_index()
concatenated_trend
plt.figure(figsize=(15,6))
plt.plot(concatenated_trend['month_year'], concatenated_trend['salesamount_usd'])
plt.xticks(rotation='vertical', size=8)
plt.show()
This is the error message displayed:
Error message
enter image description here

pandas read csv ignore ending semicolon of last column

My data file looks like this:
data.txt
user,activity,timestamp,x-axis,y-axis,z-axis
0,33,Jogging,49105962326000,-0.6946376999999999,12.680544,0.50395286;
1,33,Jogging,49106062271000,5.012288,11.264028,0.95342433;
2,33,Jogging,49106112167000,4.903325,10.882658000000001,-0.08172209;
3,33,Jogging,49106222305000,-0.61291564,18.496431,3.0237172;
As can be seen, the last column ends with a semicolon, so when I read into pandas, the column is inferred as type object (ending with the semicolon.
df = pd.read_csv('data.txt')
df
user activity timestamp x-axis y-axis z-axis
0 33 Jogging 49105962326000 -0.694638 12.680544 0.50395286;
1 33 Jogging 49106062271000 5.012288 11.264028 0.95342433;
2 33 Jogging 49106112167000 4.903325 10.882658 -0.08172209;
3 33 Jogging 49106222305000 -0.612916 18.496431 3.0237172;
How do I make pandas ignore that semicolon?
The problem with your txt is that it has mixed content. As I can see the header doesn't have the semicolon as termination character
If you change the first line adding the semicolon it's quite simple
pd.read_csv("data.txt", lineterminator=";")
Might not be the case but it works given the example.
In the docs you could find comment param that:
indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#', parsing #empty\na,b,c\n1,2,3 with header=0 will result in ‘a,b,c’ being treated as the header.
So if ; could only be found at the end of your last column:
>>> df = pd.read_csv("data.txt", comment=";")
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user 4 non-null int64
1 activity 4 non-null object
2 timestamp 4 non-null int64
3 x-axis 4 non-null float64
4 y-axis 4 non-null float64
5 z-axis 4 non-null float64
dtypes: float64(3), int64(2), object(1)
memory usage: 224.0+ bytes
>>> df
user activity timestamp x-axis y-axis z-axis
0 33 Jogging 49105962326000 -0.694638 12.680544 0.503953
1 33 Jogging 49106062271000 5.012288 11.264028 0.953424
2 33 Jogging 49106112167000 4.903325 10.882658 -0.081722
3 33 Jogging 49106222305000 -0.612916 18.496431 3.023717
You can make use of converters param:
to parse your string
to replace ;
to convert to float
df = pd.read_csv('data.txt', sep=",", converters={"z-axis": lambda x: float(x.replace(";",""))})
print(df)
data txtuser activity timestamp x-axis y-axis z-axis
0 0 33 Jogging 49105962326000 -0.694638 12.680544 0.503953
1 1 33 Jogging 49106062271000 5.012288 11.264028 0.953424
2 2 33 Jogging 49106112167000 4.903325 10.882658 -0.081722
3 3 33 Jogging 49106222305000 -0.612916 18.496431 3.023717

Group by of dataframe with average of a column

I am really new to python..just a week ago started learning it. I have a query and hope you guys can help me to solve it. Thanks in advance..!!
I have data in below format.
Date Product Price Discount
1/1/2020 A 17,490 30
1/1/2020 B 34,990 21
1/1/2020 C 20,734 11
1/2/2020 A 16,884 26
1/2/2020 B 26,990 40
1/2/2020 C 17,936 10
1/3/2020 A 16,670 36
1/3/2020 B 12,990 13
1/3/2020 C 30,990 43
I want to take the average of discount column for each date and just have 2 columns.. It aint working out.. :(
Date AVG_Discount
1/1/2020 x %
1/2/2020 y %
1/3/2020 z %
What I have tried doing is below.. As I said, I am novice in Python so approach might be incorrect.. Need guidance guys.. TIA
mean_col=df.groupby(df['time'])['discount'].mean()
df=df.set_index(['time'])
df['mean_col']=mean_col
df=df.reset_index()
df.groupby(df['time'])['discount'].mean() Is already returning series with time as index.
All you need to do is just use reset_index function on this.
grouped_df = df.groupby(df['time'])['discount'].mean().reset_index()
As Quang Hoang Suggested in comments. You can also pass as_index=False to groupby.
Apparently, you have read your DataFrame from a text file,
e.g. CSV, but with separator other than a comma.
Run df.info() and I assume that you got result something like below:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null object
Product 9 non-null object
Price 9 non-null object
Discount 9 non-null int64
dtypes: int64(1), object(3)
Note that Date, Product and Price columns are of object type
(actually, a string). This remark is especially importoant in case of
Price column, because to compte mean you should have source column
as a number (not a string).
So first you should convert Date and Price columns to proper types
(datetime and float). To do it run:
df.Date = pd.to_datetime(df.Date)
df.Price = df.Price.str.replace(',', '.').astype(float)
Run df.info() again and now the result should be:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
Date 9 non-null datetime64[ns]
Product 9 non-null object
Price 9 non-null float64
Discount 9 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
And now you can compute the mean discount, running:
df.groupby('Date').Discount.mean()
For your data I got:
Date
2020-01-01 20.666667
2020-01-02 25.333333
2020-01-03 30.666667
Name: Discount, dtype: float64
Note that your code sample contains the following errors:
Argument of groupby is the column name (or a list of column names), so:
df between parentheses is not needed,
instead of time you should write Date (you have no time column).
Your Discount column is written starting with capital D.

readin data as float per converter

I have a csv-file called 'filename' and want to read in these data as 64float, except the column 'hour'. I managed it with the pd.read_csv - function and an converter.
df = pd.read_csv("../data/filename.csv",
delimiter = ';',
date_parser = ['hour'],
skiprows = 1,
converters={'column1': lambda x: float(x.replace ('.','').replace(',','.'))})
Now, I have two points:
FIRST:
The delimiter works with ; ,but if I take a look in Notepad to my data, there are ',', not ';'. But if I take ',' I get: 'pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 13, saw 9'
SECOND:
If I want to use the converter for all columns, how can I get this?! What`s the right term?
I try to use dtype = float in the readin-function, but I get 'AttributeError: 'NoneType' object has no attribute 'dtype'' Whats happend? Thats the reasion why I want to managed it with the converter.
Data:
,hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind
offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"
This should work:
In [40]:
# text data
temp=''',hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"'''
# so read the csv, pass params quotechar and the thousands character
df = pd.read_csv(io.StringIO(temp), quotechar='"', thousands=',')
df
Out[40]:
Unnamed: 0 hour PV Wind onshore Wind offshore PV.1 Wind onshore.1 \
0 0 1 0 12985.0 9614.0 0 32825.5
1 1 2 0 12908.9 9290.8 0 36052.3
2 2 3 0 12740.9 8886.9 0 38540.9
3 3 4 0 12485.3 8644.5 0 40734.0
4 4 5 0 11188.5 8079.0 0 42688.0
5 5 6 0 11219.0 7594.2 0 43333.5
Wind offshore.1 PV.2 Wind onshore.2 Wind offshore.2
0 9495.7 0 13110.3 10855.5
1 9589.1 0 13670.2 10828.6
2 10087.3 0 14610.8 10828.6
3 10087.3 0 15638.3 10343.7
4 10087.3 0 16809.4 10343.7
5 10025.0 0 18266.9 10343.7
In [41]:
# check the dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 11 columns):
Unnamed: 0 6 non-null int64
hour 6 non-null int64
PV 6 non-null float64
Wind onshore 6 non-null float64
Wind offshore 6 non-null float64
PV.1 6 non-null float64
Wind onshore.1 6 non-null float64
Wind offshore.1 6 non-null float64
PV.2 6 non-null float64
Wind onshore.2 6 non-null float64
Wind offshore.2 6 non-null float64
dtypes: float64(9), int64(2)
memory usage: 576.0 bytes
So basically you need to pass the quotechar='"' and thousands=',' params to read_csv to achieve what you want, see the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
EDIT
If you want to convert after importing (which is a waste when you can do it upfront) then you can do this for each column of interest:
In [43]:
# replace the comma separator
df['Wind onshore'] = df['Wind onshore'].str.replace(',','')
# convert the type
df['Wind onshore'] = df['Wind onshore'].astype(np.float64)
df['Wind onshore'].dtype
Out[43]:
dtype('float64')
It would be faster to replace the comma separator on all the columns of interest first and just call convert_objects like so: df.convert_objects(convert_numeric=True)

How to force the short summary dataframe output in Ipython Notebook

I'm using IPython Notebook and would like to be able to control which type of output is returned from submitting a simple DataFrame name. For example:
df = DataFrame({"A": [1,2,3], "B": [4,5,6]})
df
will always return a grid representation because it such a small DataFrame. However larger DataFrames will show in HTML representation (at least with my settings in the notebook). That is, as long as there isn't more than about 15 columns, in which case the representation is something like this:
<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, 6215.0 to 6028.0
Data columns:
District Name 35 non-null values
Total Percent Pass 35 non-null values
GE Percent Pass 35 non-null values
SE Percent Pass 3 non-null values
White Percent Pass 11 non-null values
Black Percent Pass 23 non-null values
Hispanic Percent Pass 21 non-null values
Asian Percent Pass 4 non-null values
PI Percent Pass 0 non-null values
AI Percent Pass 0 non-null values
Other Percent Pass 0 non-null values
EcDis Percent Pass 29 non-null values
Non EcDis Percent Pass 26 non-null values
Total Percent Pass_D 35 non-null values
GE Percent Pass_D 35 non-null values
SE Percent Pass_D 31 non-null values
White Percent Pass_D 27 non-null values
Black Percent Pass_D 35 non-null values
Hispanic Percent Pass_D 35 non-null values
Asian Percent Pass_D 19 non-null values
PI Percent Pass_D 0 non-null values
AI Percent Pass_D 0 non-null values
Other Percent Pass_D 0 non-null values
EcDis Percent Pass_D 35 non-null values
Non EcDis Percent Pass_D 35 non-null values
Comparative 35 non-null values
dtypes: float64(25), object(1)
I would like to be able to force this type of representation (the last one) at will without having to turn options on an off in the notebook. Is there a way to do this?
Extending your example, you can just use df.info
In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns:
A 3 non-null values
B 3 non-null values
dtypes: int64(2)

Categories

Resources