I lose my values in the columns - python

I've organized my data using pandas. and I fill my procedure out like below
import pandas as pd
import numpy as np
df1 = pd.read_table(r'E:\빅데이터 캠퍼스\골목상권 프로파일링 - 서울 열린데이터 광장 3.초기-16년5월분1\17.상권-추정매출\201301-201605\tbsm_trdar_selng.txt\tbsm_trdar_selng_utf8.txt' , sep='|' ,header=None
,dtype = { '0' : pd.np.int})
df1 = df1.replace('201301', int(201301))
df2 = df1[[0 ,1, 2, 3 ,4, 11,12 ,82 ]]
df2_rename = df2.columns = ['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD', 'SVC_INDUTY_CD_NM', 'THSMON_SELNG_AMT', 'THSMON_SELNG_CO', 'STOR_CO' ]
print(df2.head(40))
df3_groupby = df2.groupby(['STDR_YM_CD', 'TRDAR_CD' ])
df4_agg = df3_groupby.agg(np.sum)
print(df4_agg.head(30))
When I print df2 I can see the 11947 and 11948 values in my TRDAR_CD column. like below picture
after that, I used groupby function and I lose my 11948 values in my TRDAR_CD column. You can see this situation in below picture
probably, this problem from the warning message?? warning message is 'sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.'
help me plz
print(df2.info()) is
RangeIndex: 1089023 entries, 0 to 1089022
Data columns (total 8 columns):
STDR_YM_CD 1089023 non-null object
TRDAR_CD 1089023 non-null int64
TRDAR_CD_NM 1085428 non-null object
SVC_INDUTY_CD 1089023 non-null object
SVC_INDUTY_CD_NM 1089023 non-null object
THSMON_SELNG_AMT 1089023 non-null int64
THSMON_SELNG_CO 1089023 non-null int64
STOR_CO 1089023 non-null int64
dtypes: int64(4), object(4)
memory usage: 66.5+ MB
None

MultiIndex is called first and second columns and if first level has duplicates by default it 'sparsified' the higher levels of the indexes to make the console output a bit easier on the eyes.
You can show data in first level of MultiIndex by setting display.multi_sparse to False.
Sample:
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6],
'C':[7,8,9]})
df.set_index(['A','B'], inplace=True)
print (df)
C
A B
1 4 7
5 8
3 6 9
#temporary set multi_sparse to False
#http://pandas.pydata.org/pandas-docs/stable/options.html#getting-and-setting-options
with pd.option_context('display.multi_sparse', False):
print (df)
C
A B
1 4 7
1 5 8
3 6 9
EDIT by edit of question:
I think problem is type of value 11948 is string, so it is omited.
EDIT1 by file:
You can simplify your solution by add parameter usecols in read_csv and then aggregating by GroupBy.sum:
import pandas as pd
import numpy as np
df2 = pd.read_table(r'tbsm_trdar_selng_utf8.txt' ,
sep='|' ,
header=None ,
usecols=[0 ,1, 2, 3 ,4, 11,12 ,82],
names=['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD', 'SVC_INDUTY_CD_NM', 'THSMON_SELNG_AMT', 'THSMON_SELNG_CO', 'STOR_CO'],
dtype = { '0' : int})
df4_agg = df2.groupby(['STDR_YM_CD', 'TRDAR_CD' ]).sum()
print(df4_agg.head(10))
THSMON_SELNG_AMT THSMON_SELNG_CO STOR_CO
STDR_YM_CD TRDAR_CD
201301 11947 1966588856 74798 73
11948 3404215104 89064 116
11949 1078973946 42005 45
11950 1759827974 93245 71
11953 779024380 21042 84
11954 2367130386 94033 128
11956 511840921 23340 33
11957 329738651 15531 50
11958 1255880439 42774 118
11962 1837895919 66692 68

Related

Pandas Data Frame Graphing Issue

I am curious as to why when I create a data frame in the manner below, using lists to create the values in the rows does not graph and gives me the error "ValueError: x must be a label or position"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
values = [9.83, 19.72, 7.19, 3.04]
values
[9.83, 19.72, 7.19, 3.04]
cols = ['Condition', 'No-Show']
conditions = ['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism']
df = pd.DataFrame(columns = [cols])
df['Condition'] = conditions
df['No-Show'] = values
df
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diabetes 7.19
3 Alcoholism 3.04
df.plot(kind='bar', x='Condition', y='No-Show');
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 df.plot(kind='bar', x='Condition', y='No-Show')
File ~\anaconda3\lib\site-packages\pandas\plotting\_core.py:938, in
PlotAccessor.__call__(self, *args, **kwargs)
936 x = data_cols[x]
937 elif not isinstance(data[x], ABCSeries):
--> 938 raise ValueError("x must be a label or position")
939 data = data.set_index(x)
940 if y is not None:
941 # check if we have y as int or list of ints
ValueError: x must be a label or position
Yet if I create the same DataFrame a different way, it graphs just fine....
df2 = pd.DataFrame({'Condition': ['Scholarship', 'Hipertension', 'Diatebes', 'Alcoholism'],
'No-Show': [9.83, 19.72, 7.19, 3.04]})
df2
Condition No-Show
0 Scholarship 9.83
1 Hipertension 19.72
2 Diatebes 7.19
3 Alcoholism 3.04
df2.plot(kind='bar', x='Condition', y='No-Show')
plt.ylim(0, 50)
#graph appears here just fine
Can someone enlighten me why it works the second way and not the first? I am a new student and am confused. I appreciate any insight.
Let's look at pd.DataFrame.info for both dataframes.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (Condition,) 4 non-null object
1 (No-Show,) 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note, your column headers are tuples with a empty second element.
Now, look at info for df2.
df2.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Condition 4 non-null object
1 No-Show 4 non-null float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes
Note your column headers here are strings.
As, #BigBen states in his comment you don't need the extra brackets in your dataframe constructor for df.
FYI... to fix your statement with the incorrect dataframe constructor for df.
df.plot(kind='bar', x=('Condition',), y=('No-Show',))

Pandas replace method and object datatypes

I am using df= df.replace('No data', np.nan) on a csv file containing ‘No data’ instead of blank/null entries where there is no numeric data. Using the head() method I see that the replace method does replace the ‘No data’ entries with NaN. When I use df.info() it says that the datatypes for each of the series is an object.
When I open the csv file in Excel and manually edit the data using find and replace to change ‘No data’ to blank/null entries, although the dataframes look exactly the same when I use df.head(), when I use df.info() it says that the datatypes are floats.
I was wondering why this was and how can I make it so that the datatypes for my series are floats, without having to manually edit the csv files.
If the rest of the data in your columns is numeric then you should use pd.to_numeric(df, errors='coerce')
import pandas as pd
import numpy as np
# Create data for columns with strings in it
column_data = [1,2,3] + ['no data']
# Costruct data frame with two columns
df = pd.DataFrame({'col1':column_data, 'col2':column_data[::-1]})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
col1 4 non-null object
col2 4 non-null object
dtypes: object(2)
memory usage: 144.0+ bytes
# Replace 'no data' with Nan
df_nan = df.replace('no data', np.nan)
# Set type of all columns to float
df_result = df_nan.as_type({c:float for c in df_nan.columns})
df_result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
col1 3 non-null float64
col2 3 non-null float64
dtypes: float64(2)
memory usage: 144.0 bytes
Consider using the converters arg in pandas.read_csv() where you pass a dictionary of column numbers referencing a conversion function. Below checks the No data string and conditionally replaces value with np.nan otherwise leave as is:
import numpy as np
import pandas as pd
c_fct = lambda x : float(x if 'No data' not in x else np.nan)
convertdict = {1:c_fct, 2:c_fct, 3:c_fct, 4:c_fct, 5:c_fct}
df = pd.read_csv('Input.csv', converters=convertdict)
Input CSV
ID Col1 Col2 Col3 Col4 Col5
TGG 0.634516647 0.900464347 0.998505978 0.170422713 0.893340128
GRI No data 0.045915333 0.718398939 0.924813864 No data
NLB 0.921127268 0.614460813 0.677857676 0.343612947 0.559437744
SEI 0.081852313 No data 0.890816385 0.943313021 0.874857844
LOY 0.632556715 0.362855866 0.038702448 0.253762859 No data
OPL 0.375088582 0.268283238 0.761552111 0.589547625 0.192223208
CTK 0.349464541 0.844718987 No data 0.841439909 0.898093646
EUE 0.629784261 0.982589843 0.315670377 0.832419474 0.950044814
JLP 0.543942659 0.988380305 0.417191823 0.823857176 0.542514099
RHK 0.728053447 0.521816539 0.402523435 No data 0.558226706
AEM 0.005495116 0.715363776 0.075508356 0.959119268 0.844730368
VLQ 0.21146319 0.558208766 0.501769554 0.226539046 0.795861461
MDB 0.230514689 0.223163664 No data 0.324636384 0.700716246
LPH 0.853433224 0.582678173 0.633109347 0.432191426 No data
PEP 0.41096305 No data .627776178 0.482359278 0.179863537
UQK 0.252598809 0.497517585 0.276060768 No data 0.087985623
KGJ 0.033985585 0.033702088 anNo data 0.286682709 0.543349787
JUQ 0.25971543 0.142067155 0.597985191 0.219841249 0.699822866
NYW No data 0.17187907 0.157413049 0.209011772 0.592824483
Output
print(df)
# ID Col1 Col2 Col3 Col4 Col5
# 0 TGG 0.634517 0.900464 0.998506 0.170423 0.893340
# 1 GRI NaN 0.045915 0.718399 0.924814 NaN
# 2 NLB 0.921127 0.614461 0.677858 0.343613 0.559438
# 3 SEI 0.081852 NaN 0.890816 0.943313 0.874858
# 4 LOY 0.632557 0.362856 0.038702 0.253763 NaN
# 5 OPL 0.375089 0.268283 0.761552 0.589548 0.192223
# 6 CTK 0.349465 0.844719 NaN 0.841440 0.898094
# 7 EUE 0.629784 0.982590 0.315670 0.832419 0.950045
# 8 JLP 0.543943 0.988380 0.417192 0.823857 0.542514
# 9 RHK 0.728053 0.521817 0.402523 NaN 0.558227
# 10 AEM 0.005495 0.715364 0.075508 0.959119 0.844730
# 11 VLQ 0.211463 0.558209 0.501770 0.226539 0.795861
# 12 MDB 0.230515 0.223164 NaN 0.324636 0.700716
# 13 LPH 0.853433 0.582678 0.633109 0.432191 NaN
# 14 PEP 0.410963 NaN 0.627776 0.482359 0.179864
# 15 UQK 0.252599 0.497518 0.276061 NaN 0.087986
# 16 KGJ 0.033986 0.033702 NaN 0.286683 0.543350
# 17 JUQ 0.259715 0.142067 0.597985 0.219841 0.699823
# 18 NYW NaN 0.171879 0.157413 0.209012 0.592824
print(df.types)
# ID object
# Col1 float64
# Col2 float64
# Col3 float64
# Col4 float64
# Col5 float64
# dtype: object

Sorting a Pandas dataframe

I have the following dataframe:
Join_Count 1
LSOA11CD
E01006512 15
E01006513 35
E01006514 11
E01006515 11
E01006518 11
...
But when I try to sort it:
BusStopList.sort("LSOA11CD",ascending=1)
I get the following:
Key Error: 'LSOA11CD'
How do I go about sorting this by either the LSOA column or the column full of numbers which doesn't have a heading?
The following is the information produced by Python about this dataframe:
<class 'pandas.core.frame.DataFrame'>
Index: 286 entries, E01006512 to E01033768
Data columns (total 1 columns):
1 286 non-null int64
dtypes: int64(1)
memory usage: 4.5+ KB
'LSOA11CD' is the name of the index, 1 is the name of the column. So you must use sort index (rather than sort_values):
BusStopList.sort_index(level="LSOA11CD", ascending=True)

readin data as float per converter

I have a csv-file called 'filename' and want to read in these data as 64float, except the column 'hour'. I managed it with the pd.read_csv - function and an converter.
df = pd.read_csv("../data/filename.csv",
delimiter = ';',
date_parser = ['hour'],
skiprows = 1,
converters={'column1': lambda x: float(x.replace ('.','').replace(',','.'))})
Now, I have two points:
FIRST:
The delimiter works with ; ,but if I take a look in Notepad to my data, there are ',', not ';'. But if I take ',' I get: 'pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 13, saw 9'
SECOND:
If I want to use the converter for all columns, how can I get this?! What`s the right term?
I try to use dtype = float in the readin-function, but I get 'AttributeError: 'NoneType' object has no attribute 'dtype'' Whats happend? Thats the reasion why I want to managed it with the converter.
Data:
,hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind
offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"
This should work:
In [40]:
# text data
temp=''',hour,PV,Wind onshore,Wind offshore,PV.1,Wind onshore.1,Wind offshore.1,PV.2,Wind onshore.2,Wind offshore.2
0,1,0.0,"12,985.0","9,614.0",0.0,"32,825.5","9,495.7",0.0,"13,110.3","10,855.5"
1,2,0.0,"12,908.9","9,290.8",0.0,"36,052.3","9,589.1",0.0,"13,670.2","10,828.6"
2,3,0.0,"12,740.9","8,886.9",0.0,"38,540.9","10,087.3",0.0,"14,610.8","10,828.6"
3,4,0.0,"12,485.3","8,644.5",0.0,"40,734.0","10,087.3",0.0,"15,638.3","10,343.7"
4,5,0.0,"11,188.5","8,079.0",0.0,"42,688.0","10,087.3",0.0,"16,809.4","10,343.7"
5,6,0.0,"11,219.0","7,594.2",0.0,"43,333.5","10,025.0",0.0,"18,266.9","10,343.7"'''
# so read the csv, pass params quotechar and the thousands character
df = pd.read_csv(io.StringIO(temp), quotechar='"', thousands=',')
df
Out[40]:
Unnamed: 0 hour PV Wind onshore Wind offshore PV.1 Wind onshore.1 \
0 0 1 0 12985.0 9614.0 0 32825.5
1 1 2 0 12908.9 9290.8 0 36052.3
2 2 3 0 12740.9 8886.9 0 38540.9
3 3 4 0 12485.3 8644.5 0 40734.0
4 4 5 0 11188.5 8079.0 0 42688.0
5 5 6 0 11219.0 7594.2 0 43333.5
Wind offshore.1 PV.2 Wind onshore.2 Wind offshore.2
0 9495.7 0 13110.3 10855.5
1 9589.1 0 13670.2 10828.6
2 10087.3 0 14610.8 10828.6
3 10087.3 0 15638.3 10343.7
4 10087.3 0 16809.4 10343.7
5 10025.0 0 18266.9 10343.7
In [41]:
# check the dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 11 columns):
Unnamed: 0 6 non-null int64
hour 6 non-null int64
PV 6 non-null float64
Wind onshore 6 non-null float64
Wind offshore 6 non-null float64
PV.1 6 non-null float64
Wind onshore.1 6 non-null float64
Wind offshore.1 6 non-null float64
PV.2 6 non-null float64
Wind onshore.2 6 non-null float64
Wind offshore.2 6 non-null float64
dtypes: float64(9), int64(2)
memory usage: 576.0 bytes
So basically you need to pass the quotechar='"' and thousands=',' params to read_csv to achieve what you want, see the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv
EDIT
If you want to convert after importing (which is a waste when you can do it upfront) then you can do this for each column of interest:
In [43]:
# replace the comma separator
df['Wind onshore'] = df['Wind onshore'].str.replace(',','')
# convert the type
df['Wind onshore'] = df['Wind onshore'].astype(np.float64)
df['Wind onshore'].dtype
Out[43]:
dtype('float64')
It would be faster to replace the comma separator on all the columns of interest first and just call convert_objects like so: df.convert_objects(convert_numeric=True)

pandas update a dataframe column based on prefiltered groupby object

given dataframe d such as this:
index col1
1 a
2 a
3 b
4 b
Create a prefiltered group object with new values:
g = d[prefilter].groupby(['some cols']).apply( somefunc )
index col1
2 c
4 d
Now I want to update df to this:
index col1
1 a
2 c
3 b
4 d
Ive been hacking away with update, ix, filtering, where, etc... I am guessing there is an obvious solution I am not seeing here.
stuff like this is not working:
d[d.index == db.index]['alert_v'] = db['alert_v']
q90 = g.transform( somefunc )
d.ix[ d['alert_v'] >=q90, 'alert_v'] = 1
d.ix[ d['alert_v'] < q90, 'alert_v'] = 0
d['alert_v'] = np.where( d.index==db.index, db['alert_v'], d['alert_v'] )
any help is appreciated
thankyou
--edit--
the two dataframes are in the same form:
one is simply a filtered version of the other, with different values, that I want to update to the original.
ValueError: cannot reindex from a duplicate axis
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 2186 non-null object
subject_id 2186 non-null float64
alert_t 2186 non-null object
variable 2186 non-null object
timeindex 2186 non-null datetime64[ns]
alert_v 2105 non-null float64
value 2186 non-null float64
tavg 54 non-null timedelta64[ns]
iqt 61 non-null object
dtypes: datetime64[ns](1), float64(3), object(4), timedelta64[ns](1)None<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1982 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 1982 non-null object
subject_id 1982 non-null float64
alert_t 1982 non-null object
variable 1982 non-null object
timeindex 1982 non-null datetime64[ns]
alert_v 1982 non-null int64
value 1982 non-null float64
tavg 0 non-null timedelta64[ns]
iqt 0 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4), timedelta64[ns](1)None
you want the df.update() function.
Try something like this:
import pandas as pd
df1 = pd.DataFrame({'Index':[1,2,3,4],'Col1':['A', 'B', 'C', 'D']}).set_index('Index')
df2 = pd.DataFrame({'Index':[2,4],'Col1':['E', 'F']}).set_index('Index')
print df1
Col1
Index
1 A
2 B
3 C
4 D
df1.update(df2)
print df1
Col1
Index
1 A
2 E
3 C
4 F

Categories

Resources