pandas dataframe plot columns - python

After reading a series of files I create a dataframe with 7 columns:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 756 entries, 0 to 755
Data columns:
Fr(Hz) 756 non-null values
res_ohm*m 756 non-null values
phase_mrad 756 non-null values
ImC_S/m 756 non-null values
Rm_S/m 756 non-null values
C_el 756 non-null values
date 756 non-null values
dtypes: float64(6), object(1)
then I want to group the date by column 6 (C_el) which has 12 variables:
Pairs = = data_set.groupby('C_el')
each group now contains data that are multiple of 21 (that means each 21 lines I have a new unique dataset) - 21 refers to the column 1 (Fr(Hz) where I am using 21 frequencies for each dataset
what I want to do is to create an x, y scattered plot - on X axis is column 1 (Fr(Hz), and on Y axis is column 3 (phase_mrad) - each dataset will have the 21 unique poits of frequency, and then I want to add all available datasets on the same plot, using different color
the final step, is to repeat this for the 11 remaining groups (as defined in an aearlier step)
sample datasets are here (A12)
currently I do this very ugly in numpy multiple_datasets

I don't know if this will really satisfy your requirement, but I think groupby could do you a lot of favour. For instance, instead of the code example that you provided, you could instead do this:
for key, group in data_set.groupby('C_el'):
# -- define the filename, path, etc..
# e.g. filename = key
group.to_csv(filename, sep=' ')
See also the documentation here. Sorry I can't help you out with more details, but I hope it helps to proceed somewhat.

Related

adding new column to a pandas dataframe based on another dataframe column with different rows number

I'm new on Python applied to DataScience and I'm a bit stuck on a (simple) problem...
I have 2 data frames: data (stores last Madrid's election results) and map_municipios (stores data of Madrid's municipalities as GeoJSON).
I would like to represent a map with election results. I could achieve it, but as I couldn't to put proper geographical info from map_municipios in data dataframe, municipalities names and info are not OK (i.e. Madrid municipality shows another name and results).
My two data frames have the following info:
data.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cpro 179 non-null int64
1 cmun 179 non-null int64
2 municipio 179 non-null object
......
map_municipios.info()
16 cpro 164 non-null object
17 cmun 164 non-null object
18 dc 164 non-null object
19 codigo_post 164 non-null object
20 geometry 182 non-null geometry
All my project info is on these links:
MadridElections Full Repo
Well I'm trying that info between both dataframes fix properly using the following columns: 'cpro' and 'cmun'. If cpro and cmun are equal in both dataframes, 'geometry' column value has to be added to new 'geometry' column in data dataframe.
Searching info I tried the following:
data['geometry'] = np.where(
(data['cpro'].equals(map_municipios['cpro'])) &
(data['cmun'].equals(map_municipios['cmun'])),
map_municipios['geometry'], 0
)
It returns an error because len(data)!=len(map_municipios) (179!=182). len(data) is the correct number os municipalities of Madrid region.
I tried using pd.Series before np.where(...), but it creates the data['geometry'] column fulfilled of zeroes instead coordinates values.
QUESTION: is there any easy way to get my map_municipios['geometry'] column and give it to data dataframe in correct order (despite of having len different values, extra values would be dismissed).
Any hint, link, etc will be really appreciated.
Thank you in advance.

How to plot a large dataframe

This is what my dataframe looks like :
Date,Sales, location
There are a total of 20k entries. Dates are from 2016-2019. I need to have dates on x axis and sales on y axis. This is how I have done it
df1.plot(x="DATE", y=["Total_Sales"], kind="bar", figsize=(1000,20))
Unfortunately even with this the dates aren't clearly visible. How do I make sure that they are pretty plotted? Is there a way to create bins or something.
Edit: output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 382 entries, 0 to 18116
Data columns (total 5 columns):
DATE 382 non-null object
Total_Sales 358 non-null float64
Total_Sum 24 non-null float64
Total_Units 382 non-null int64
locationkey 382 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 17.9+ KB
Edit: Maybe I can divide it into years stacked on top of each other. So, for instance, Jan to Dec 16 will be the first and then succeeding years will be plotted with it. How do I do that?
I recommend that you do this:
df.DATE = pd.to_datetime(df.DATE)
df = df.set_index('DATE')
Now the dataframe's index is the date. This is very convenient. For example, you can do:
df.resample('Y').sum()
You should also be able to plot:
df.Total_Sales.plot()
And pandas will take care of making the x-axis linear in time, formatting the date, etc.

Sorting a Pandas dataframe

I have the following dataframe:
Join_Count 1
LSOA11CD
E01006512 15
E01006513 35
E01006514 11
E01006515 11
E01006518 11
...
But when I try to sort it:
BusStopList.sort("LSOA11CD",ascending=1)
I get the following:
Key Error: 'LSOA11CD'
How do I go about sorting this by either the LSOA column or the column full of numbers which doesn't have a heading?
The following is the information produced by Python about this dataframe:
<class 'pandas.core.frame.DataFrame'>
Index: 286 entries, E01006512 to E01033768
Data columns (total 1 columns):
1 286 non-null int64
dtypes: int64(1)
memory usage: 4.5+ KB
'LSOA11CD' is the name of the index, 1 is the name of the column. So you must use sort index (rather than sort_values):
BusStopList.sort_index(level="LSOA11CD", ascending=True)

Plotting with GroupBy in Pandas/Python

Although it is straight-forward and easy to plot groupby objects in pandas, I am wondering what the most pythonic (pandastic?) way to grab the unique groups from a groupby object is. For example:
I am working with atmospheric data and trying to plot diurnal trends over a period of several days or more. The following is the DataFrame containing many days worth of data where the timestamp is the index:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10909 entries, 2013-08-04 12:01:00 to 2013-08-13 17:43:00
Data columns (total 17 columns):
Date 10909 non-null values
Flags 10909 non-null values
Time 10909 non-null values
convt 10909 non-null values
hino 10909 non-null values
hinox 10909 non-null values
intt 10909 non-null values
no 10909 non-null values
nox 10909 non-null values
ozonf 10909 non-null values
pmtt 10909 non-null values
pmtv 10909 non-null values
pres 10909 non-null values
rctt 10909 non-null values
smplf 10909 non-null values
stamp 10909 non-null values
no2 10909 non-null values
dtypes: datetime64[ns](1), float64(11), int64(2), object(3)
To be able to average (and take other statistics) the data at every minute for several days, I group the dataframe:
data = no.groupby('Time')
I can then easily plot the mean NO concentration as well as quartiles:
ax = figure(figsize=(12,8)).add_subplot(111)
title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
ylabel('Concentration [ppb]')
data.no.mean().plot(ax=ax, style='b', label='Mean')
data.no.apply(lambda x: percentile(x, 25)).plot(ax=ax, style='r', label='25%')
data.no.apply(lambda x: percentile(x, 75)).plot(ax=ax, style='r', label='75%')
The issue that fuels my question, is that in order to plot more interesting looking things like plots using like fill_between() it is necessary to know the x-axis information per the documentation
fill_between(x, y1, y2=0, where=None, interpolate=False, hold=None, **kwargs)
For the life of me, I cannot figure out the best way to accomplish this. I have tried:
Iterating over the groupby object and creating an array of the groups
Grabbing all of the unique Time entries from the original DataFrame
I can make these work, but I know there is a better way. Python is far too beautiful. Any ideas/hints?
UPDATES:
The statistics can be dumped into a new dataframe using unstack() such as
no_new = no.groupby('Time')['no'].describe().unstack()
no_new.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1440 entries, 00:00 to 23:59
Data columns (total 8 columns):
count 1440 non-null values
mean 1440 non-null values
std 1440 non-null values
min 1440 non-null values
25% 1440 non-null values
50% 1440 non-null values
75% 1440 non-null values
max 1440 non-null values
dtypes: float64(8)
Although I should be able to plot with fill_between() using no_new.index, I receive a TypeError.
Current Plot code and TypeError:
ax = figure(figzise=(12,8)).add_subplot(111)
ax.plot(no_new['mean'])
ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5, facecolor='green')
TypeError:
TypeError Traceback (most recent call last)
<ipython-input-6-47493de920f1> in <module>()
2 ax = figure(figsize=(12,8)).add_subplot(111)
3 ax.plot(no_new['mean'])
----> 4 ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5, facecolor='green')
5 #title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
6 #ylabel('Concentration [ppb]')
C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\matplotlib\axes.pyc in fill_between(self, x, y1, y2, where, interpolate, **kwargs)
6986
6987 # Convert the arrays so we can work with them
-> 6988 x = ma.masked_invalid(self.convert_xunits(x))
6989 y1 = ma.masked_invalid(self.convert_yunits(y1))
6990 y2 = ma.masked_invalid(self.convert_yunits(y2))
C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\numpy\ma\core.pyc in masked_invalid(a, copy)
2237 cls = type(a)
2238 else:
-> 2239 condition = ~(np.isfinite(a))
2240 cls = MaskedArray
2241 result = a.view(cls)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The plot as of now looks like this:
Storing the groupby stats (mean/25/75) as columns in a new dataframe and then passing the new dataframe's index as the x parameter of plt.fill_between() works for me (tested with matplotlib 1.3.1). e.g.,
gdf = df.groupby('Time')[col].describe().unstack()
plt.fill_between(gdf.index, gdf['25%'], gdf['75%'], alpha=.5)
gdf.info() should look like this:
<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 00:00:00 to 22:00:00
Data columns (total 8 columns):
count 12 non-null float64
mean 12 non-null float64
std 12 non-null float64
min 12 non-null float64
25% 12 non-null float64
50% 12 non-null float64
75% 12 non-null float64
max 12 non-null float64
dtypes: float64(8)
Update: to address the TypeError: ufunc 'isfinite' not supported exception, it is necessary to first convert the Time column from a series of string objects in "HH:MM" format to a series of datetime.time objects, which can be done as follows:
df['Time'] = df.Time.map(lambda x: pd.datetools.parse(x).time())

Is there a way to group by logical comparison of two columns in Pandas?

I have a a dataframe with the following structure:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1152 entries, 0 to 143
Data columns:
cuepos 1152 non-null values
response 1152 non-null values
soa 1152 non-null values
targetpos 1152 non-null values
testorientation 1152 non-null values
dtypes: float64(3), int64(2)
The cuepos column and the targetpos column both contain integer values of either 1 or 2.
I would like to group this data by congruency between cuepos and targetpos. In other words, I would like to produce two groups, one for rows in which cuepos == targetpos and another group for which cuepos != targetpos.
I can't seem to figure out how I might do this. I looked at using grouping functions, but these seem only to act on a single column... or am I mistaken? Can someone point me in the right direction?
Thanks in advance!
Blz
Note, if you goal is to do group computations, you can do
df.groupby(df.col1 == df.col2).apply(f)
and the result will be keyed by True/False.
you can group by multiple columns:
df.groupby(['col1', 'col2']).apply(lambda x: x['col1'] == x['col2'], axis=1)
you can also use a mask:
df[df.col1==df.col2]

Categories

Resources