In Python how can I format a pandas dataframe and crosstab?

In Python how can I format a pandas dataframe and crosstab? - python

This is my code:
import pandas as pd
cols= ['DD','MM','YYYY','HH'] #names
DD,MM,YYYY,HH=[1,2,None,4,5,5],[1,1,1,2,2,3],[2014,2014,2014,2014,2014,2014],[20,20,20,18,18,18] #data
df = pd.DataFrame(list(zip(DD,MM,YYYY,HH)), columns =cols )
print (df)
a=pd.crosstab(df .HH, df .MM,margins=True)
print (a)
I would like to view results in a table format. Table borders or at least the same number of digits would solve the problem.
I want to see the table on console without any graphical window.

If you want a nicely looking crosstab you can use seaborn.heatmap
An example
>>> import numpy as np; np.random.seed(0)
>>> import seaborn as sns; sns.set()
>>> uniform_data = np.random.rand(10, 12)
>>> ax = sns.heatmap(uniform_data)
result would look like this:
You can find many examples that show how to apply this, e.g.:
https://www.science-emergence.com/Codes/How-to-plot-a-confusion-matrix-with-matplotlib-and-seaborn/
https://gist.github.com/shaypal5/94c53d765083101efc0240d776a23823
Update
In order to simply display the crosstab in a formatted way you can skip the print and display like this
import pandas as pd
cols= ['DD','MM','YYYY','HH'] #names
DD,MM,YYYY,HH=[1,2,None,4,5,5],[1,1,1,2,2,3],[2014,2014,2014,2014,2014,2014],[20,20,20,18,18,18] #data
df = pd.DataFrame(list(zip(DD,MM,YYYY,HH)), columns =cols )
print (df)
a = pd.crosstab(df .HH, df .MM,margins=True)
display(a)
which will yield the same result as:
pd.crosstab(df .HH, df .MM,margins=True)

Related

Pandas groupby using only year and month

I have a Python program using Pandas, which reads two dataframes, obtained in the following links:
Casos-positivos-diarios-en-San-Nicolas-de-los-Garza-Promedio-movil-de-7-dias: https://datamexico.org/es/profile/geo/san-nicolas-de-los-garza#covid19-evolucion
Denuncias-segun-bien-afectado-en-San-Nicolas-de-los-GarzaClic-en-el-grafico-para-seleccionar: https://datamexico.org/es/profile/geo/san-nicolas-de-los-garza#seguridad-publica-denuncias
What I currently want to do is a groupby in the "covid" dataframe with the same dates, having a sum of these. Regardless, no method has worked out, which regularly prints an error indicating that I should be using a syntaxis for "PeriodIndex". Does anyone have a suggestion or solution? Thanks in advance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
#csv for the covid cases
covid = pd.read_csv('Casos-positivos-diarios-en-San-Nicolas-de-los-Garza-Promedio-movil-de-7-dias.csv')
#csv for complaints
comp = pd.read_csv('Denuncias-segun-bien-afectado-en-San-Nicolas-de-los-GarzaClic-en-el-grafico-para-seleccionar.csv')
#cleaning data in both dataframes
#keeping only the relevant columns
covid = covid[['Month','Daily Cases']]
comp = comp[['Month','Affected Legal Good', 'Value']]
#changing the labels from spanish to english
comp['Affected Legal Good'].replace({'Patrimonio': 'Heritage', 'Familia':'Family', 'Libertad y Seguridad Sexual':'Sexual Freedom and Safety', 'Sociedad':'Society', 'Vida e Integridad Corporal':'Life and Bodily Integrity', 'Libertad Personal':'Personal Freedom', 'Otros Bienes Jurídicos Afectados (Del Fuero Común)':'Other Affected Legal Assets (Common Jurisdiction)'}, inplace=True, regex=True)
#changing the month types to dates
covid['Month'] = pd.to_datetime(covid['Month'])
covid['Month'] = covid['Month'].dt.to_period('M')
covid

You can simply usen group by statement.Timegrouper by default converts it to datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
#csv for the covid cases
covid = pd.read_csv('Casos-positivos-diarios-en-San-Nicolas-de-los-Garza-Promedio-movil-de-7-dias.csv')
covid = covid.groupby(['Month'])['Daily Cases'].sum()
covid = covid.reset_index()
# #changing the month types to dates
covid['Month'] = pd.to_datetime(covid['Month'])
covid['Month'] = covid['Month'].dt.to_period('M')
covid

Adding tooltip to folium.features.GeoJson from a geopandas dataframe

I am having issues adding tooltips to my folium.features.GeoJson. I can't get columns to display from the dataframe when I select them.
feature = folium.features.GeoJson(df.geometry,
name='Location',
style_function=style_function,
tooltip=folium.GeoJsonTooltip(fields= [df.acquired],aliases=["Time"],labels=True))
ax.add_child(feature)
For some reason when I run the code above it responds with
Name: acquired, Length: 100, dtype: object is not available in the data. Choose from: ().
I can't seem to link the data to my tooltip.

have made your code a MWE by including some data
two key issues with your code
need to pass properties not just geometry to folium.features.GeoJson() Hence passed df instead of df.geometry
folium.GeoJsonTooltip() takes a list of properties (columns) not an array of values. Hence passed ["acquired"] instead of array of values from a dataframe column
implied issue with your code. All dataframe columns need to contain values that can be serialised to JSON. Hence conversion of acquired to string and drop()
import geopandas as gpd
import pandas as pd
import shapely.wkt
import io
import folium
df = pd.read_csv(io.StringIO("""ref;lanes;highway;maxspeed;length;name;geometry
A3015;2;primary;40 mph;40.68;Rydon Lane;MULTILINESTRING ((-3.4851169 50.70864409999999, -3.4849879 50.7090007), (-3.4857269 50.70693379999999, -3.4853034 50.7081574), (-3.488620899999999 50.70365289999999, -3.4857269 50.70693379999999), (-3.4853034 50.7081574, -3.4851434 50.70856839999999), (-3.4851434 50.70856839999999, -3.4851169 50.70864409999999))
A379;3;primary;50 mph;177.963;Rydon Lane;MULTILINESTRING ((-3.4763853 50.70886769999999, -3.4786112 50.70811229999999), (-3.4746017 50.70944449999999, -3.4763853 50.70886769999999), (-3.470350900000001 50.71041779999999, -3.471219399999999 50.71028909999998), (-3.465049699999999 50.712158, -3.470350900000001 50.71041779999999), (-3.481215600000001 50.70762499999999, -3.4813909 50.70760109999999), (-3.4934747 50.70059599999998, -3.4930204 50.7007898), (-3.4930204 50.7007898, -3.4930048 50.7008015), (-3.4930048 50.7008015, -3.4919513 50.70168349999999), (-3.4919513 50.70168349999999, -3.49137 50.70213669999998), (-3.49137 50.70213669999998, -3.4911565 50.7023015), (-3.4911565 50.7023015, -3.4909108 50.70246919999999), (-3.4909108 50.70246919999999, -3.4902349 50.70291189999999), (-3.4902349 50.70291189999999, -3.4897693 50.70314579999999), (-3.4805021 50.7077218, -3.4806265 50.70770150000001), (-3.488620899999999 50.70365289999999, -3.4888806 50.70353719999999), (-3.4897693 50.70314579999999, -3.489176800000001 50.70340539999999), (-3.489176800000001 50.70340539999999, -3.4888806 50.70353719999999), (-3.4865751 50.70487679999999, -3.4882604 50.70375799999999), (-3.479841700000001 50.70784459999999, -3.4805021 50.7077218), (-3.4882604 50.70375799999999, -3.488620899999999 50.70365289999999), (-3.4806265 50.70770150000001, -3.481215600000001 50.70762499999999), (-3.4717096 50.71021009999998, -3.4746017 50.70944449999999), (-3.4786112 50.70811229999999, -3.479841700000001 50.70784459999999), (-3.471219399999999 50.71028909999998, -3.4717096 50.71021009999998))"""),
sep=";")
df = gpd.GeoDataFrame(df, geometry=df["geometry"].apply(shapely.wkt.loads), crs="epsg:4326")
df["acquired"] = pd.date_range("8-feb-2022", freq="1H", periods=len(df))
def style_function(x):
return {"color":"blue", "weight":3}
ax = folium.Map(
location=[sum(df.total_bounds[[1, 3]]) / 2, sum(df.total_bounds[[0, 2]]) / 2],
zoom_start=12,
)
# data time is not JSON serializable...
df["tt"] = df["acquired"].dt.strftime("%Y-%b-%d %H:%M")
feature = folium.features.GeoJson(df.drop(columns="acquired"),
name='Location',
style_function=style_function,
tooltip=folium.GeoJsonTooltip(fields= ["tt"],aliases=["Time"],labels=True))
ax.add_child(feature)

How to apply a style to a Python DataFrame for all rows except the last one?

I am trying to apply a Bar Style to all of the data in the dataframe, except the last row, which is supposed to be the Total row.
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
data.loc['Total'] = data.sum()
A B
0 -1.224620 -0.373898
1 0.75568 0.997875
2 -1.284663 -0.211903
3 -0.274813 -0.871816
4 1.256267 -0.742521
Total -0.772143 -1.202263

It was explained in the docs, that
A tuple is treated as (row_indexer, column_indexer)
You just need to twist a bit the subset option.
On your data
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
data.loc['Total'] = data.sum()
data.style.bar(subset = ([0,1,2,3,4], ['A', 'B']))
it gives

How to plot data based on given time?

I have a dataset like the one shown below.
Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000
I've used pandas to get the data into a DataFrame. The dataset has data for multiple days with an interval of 1 min for each row in the dataset.
I want to plot separate graphs for the voltage with respect to the time(shown in column 2) for each day(shown in column 1) using python. How can I do that?

txt = '''Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000'''
from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =';' )
plt.plot(df['Time'],df['Voltage'])
plt.show()
gives output :

I believe this will do the trick (I edited the dates so we have two dates)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline #If you use Jupyter Notebook
df = pd.read_csv('test.csv', sep=';', usecols=['Date','Time','Voltage'])
unique_dates = df.Date.unique()
for date in unique_dates:
print('Date: ' + date)
df.loc[df.Date == date].plot.line('Time', 'Voltage')
plt.show()
You will get this:

X = df.Date.unique()
for i in X: #iterate over unique days
temp_df = df[df.Date==i] #get df for specific day
temp_df.plot(x = 'Time', y = 'Voltage') #plot
If you want to change x values you can use
x = np.arange(1, len(temp_df.Time), 1)

group by hour and minute after creating a DateTime variable to handle multiple days. you can filter the grouped for a specific day.
txt =
'''Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000'''
from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =';' )
df['DateTime']=pd.to_datetime(df['Date']+"T"+df['Time']+"Z")
df.set_index('DateTime',inplace=True)
filter=df['Date']=='16/12/2006'
grouped=df[filter].groupby([df.index.hour,df.index.minute])['Voltage'].mean()
grouped.plot()
plt.show()

Why can't I search for a row in a pandas df using a date as part of a tuple index?

I am trying to search a pandas df I made which has a tuple as an index. The first part of the tuple is a date and the second part is a forex pair. I've tried a few things but I can't seem to search using a date-formatted string as part of a tuple with .loc or .ix
My df looks like this:
Open Close
(11-01-2018, AEDAUD) 0.3470 0.3448
(11-01-2018, AEDCAD) 0.3415 0.3408
(11-01-2018, AEDCHF) 0.2663 0.2656
(11-01-2018, AEDDKK) 1.6955 1.6838
(11-01-2018, AEDEUR) 0.2277 0.2261
Here is the complete code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
forex_11 = pd.read_csv('FOREX_20180111.csv', sep=',', parse_dates=['Date'])
forex_12 = pd.read_csv('FOREX_20180112.csv', sep=',', parse_dates=['Date'])
time_format = '%d-%m-%Y'
forex = forex_11.append(forex_12, ignore_index=False)
forex['Date'] = forex['Date'].dt.strftime(time_format)
GBP = forex[forex['Symbol'] == "GBPUSD"]
forex.index = list(forex[['Date', 'Symbol']].itertuples(index=False, name=None))
forex_open_close = pd.DataFrame(np.array(forex[['Open','Close']]), index=forex.index)
forex_open_close.columns = ['Open', 'Close']
print(forex_open_close.head())
print(forex_open_close.ix[('11-01-2018', 'GBPUSD')])
How do I get the row which has index ('11-01-2018', 'GBPUSD') ?

Can you try putting the tuple in a list using brackets?
Like this:
print(forex_open_close.ix[[('11-01-2018', 'GBPUSD')]])

I would recommend using the Pandas multiIndex. In your case you could do the following:
tuples = list(data[['Date', 'Symbol']].itertuples(index=False, name=None))
data.index = pd.MultiIndex.from_tuples(tuples, names=['Date', 'Symbol'])
# And then to index
data.loc['2018-01-11', 'AEDCAD']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

In Python how can I format a pandas dataframe and crosstab? - python

Related

Pandas groupby using only year and month

Adding tooltip to folium.features.GeoJson from a geopandas dataframe

How to apply a style to a Python DataFrame for all rows except the last one?

How to plot data based on given time?

Why can't I search for a row in a pandas df using a date as part of a tuple index?

Categories

Resources