Python - How to make crosstable in pandas from non numeric data? - python

So, the thing is I need to create a crosstable from string data. I mean like in excel, if You put some string data into crosstable it is going to be automatically transformed into counted values per the other factor. For instance, I have column 'A' which contains application numbers and column 'B' which contains dates. I need to show how many applications were placed per each day. Classic crosstable returns me an error.
data.columns = [['applicationnumber', 'date', 'param1', 'param2', 'param3']] #mostly string values
Examples of input data:
applicationnumber = "AAA12345678"
date = 'YYYY-MM-DD'

Is this what you are looking for:
df = pd.DataFrame([['app1', '01/01/2019'],
['app2', '01/02/2019'],
['app3', '01/02/2019'],
['app4', '01/02/2019'],
['app5', '01/04/2019'],
['app6', '01/04/2019']],
columns=['app.no','date'])
print(pd.pivot_table(df, values='app.no', index='date', aggfunc=np.size))
Output:
app.no
date
01/01/2019 1
01/02/2019 3
01/04/2019 2

Related

Pycharm problem set (Stuck from step 3 onwards)

Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before

How to relocate different data that is in a single column to their respective columns?

I have a dataframe whose data are strings and different information are mixed in a single column. Like this:
0
Place: House
1
Date/Time: 01/02/03 at 09:30
2
Color:Yellow
3
Place: Street
4
Date/Time: 12/12/13 at 13:21:21
5
Color:Red
df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])
I need the dataframe like this:
Place
Date/Time
Color
House
01/02/03
Yellow
Street
21/12/13
Red
I started by converting the excel file to csv, and then I tried to open it as follows:
df = pd.read_csv(filename, sep=":")
I tried using the ":" to separate the columns, but the time formatting also uses ":", so it didn't work. The time is not important information so I even tried to delete it and keep the date, but I couldn't find a way that wouldn't affect the other information in the column either.
Given the values in your data, you will need to limit the split to just happen once, which you can do with n parameter of split. You can expand the split values into two columns then pivot.
The trick here is to create a grouping by taking the df.index // 3 as the index, so that every 3 lines is in a new group.
df = pd.DataFrame(['Place: House','Date/Time: 01/02/03 at 09:30', 'Color:Yellow', 'Place: Street','Date/Time: 21/12/13 at 13:21:21', 'Color:Red'])
df = df[0].str.split(':', n=1, expand=True)
df['idx'] = df.index//3
df.pivot(index='idx', columns=0, values=1).reset_index().drop(columns='idx')[['Place','Date/Time','Color']]
Output
0 Place Date/Time Color
0 House 01/02/03 at 09:30 Yellow
1 Street 21/12/13 at 13:21:21 Red
Your data is all strings, IMO you are likely to get better performance wrangling it within vanilla python, before bringing it back into Pandas; the only time you are likely to get better performance for strings in Pandas is if you are using the pyarrow string data type.
from collections import defaultdict
out = df.squeeze().tolist() # this works since it is just one column
frame = defaultdict(list)
for entry in out:
key, value = entry.split(':', maxsplit=1)
if key == "Date/Time":
value = value.split('at')[0]
value = value.strip()
key = key.strip() # not really necessary
frame[key].append(value)
pd.DataFrame(frame)
Place Date/Time Color
0 House 01/02/03 Yellow
1 Street 21/12/13 Red

filtering pandas dataframe when data contains two parts

I have a pandas dataframe and want to filter down to all the rows that contain a certain criteria in the “Title” column.
The rows I want to filter down to are all rows that contain the format “(Axx)” (Where xx are 2 numbers).
The data in the “Title” column doesn’t just consist of “(Axx)” data.
The data in the “Title” column looks like so:
“some_string (Axx)”
What Ive been playing around a bit with different methods but cant seem to get it.
I think the closest ive gotten is:
df.filter(regex=r'(D\d{2})', axis=0))
but its not correct as the entries aren’t being filtered.
Use Series.str.contains with escape () and $ for end of string and filter in boolean indexing:
df = pd.DataFrame({'Title':['(D89)','aaa (D71)','(D5)','(D78) aa','D72']})
print (df)
Title
0 (D89)
1 aaa (D71)
2 (D5)
3 (D78) aa
df1 = df[df['Title'].str.contains(r'\(D\d{2}\)$')]
print (df1)
4 D72
Title
0 (D89)
1 aaa (D71)
If ned match only (Dxx) use Series.str.match:
df2 = df[df['Title'].str.match(r'\(D\d{2}\)$')]
print (df2)
Title
0 (D89)

Better way to convert pandas column containing numbers stored as text to numbers

I've got an Excel file where the material group column contains both numbers (e.g.: 1120) and strings (8120M). In a different report (which is handled by a different team and I can't edit it) the same column is string only (number stored as text). In order to use pd.merge() or any Excel functions I have to convert the numbers in that file to numbers. Most of the merges are done based on that column, there's no workaround (the rest are based PO and vendor number fortunately)
This works, but seems really ham-fisted. The report itself is around ~15,000 lines monthly, so even the YTD rolling report won't go over 200k lines. Still, if there's a more elegant solution I'd like to know as my data won't always be this small.
raw["Matl Group"] = (
raw["Matl Group"]
.apply(pd.to_numeric, errors="coerce")
.fillna(raw["Matl Group"])
)
You can use pd.to_numeric without apply:
df = pd.DataFrame({'Matl Group': ['1120', '8120M']})
df['Matl Group'] = pd.to_numeric(df['Matl Group'], errors='coerce') \
.fillna(df['Matl Group'])
print(df)
# Output:
Matl Group
0 1120.0
1 8120M
Alternatively, you can use pd.factorize to create a numeric value of Matl Group:
df['Matl Group Numeric'] = pd.factorize(df['Matl Group'])[0]
print(df)
# Output:
Matl Group Matl Group Numeric
0 1120 0
1 8120M 1

Python Pandas Splitting Strings and Storing the Remainder in New Row

I have a pandas dataframe where observations are broken out per every two days. The values in the 'Date' column each describe a range of two days (eg 2020-02-22 to 2020-02-23).
I want to spit those Date values into individual days, with a row for each day. The closest I got was by doing newdf = df_day.set_index(df_day.columns.drop('Date',1).tolist()).Date.str.split(' to ', expand=True).stack().reset_index().loc[:, df_day.columns]
The problem here is that the new date values are returned as NaNs. Is there a way to achieve this data broken out by individual day?
I might not be understanding, but based on the image it's a single date per row as is, just poorly labeled -- I would manipulate the index strings, and if I can't do that I would create a new date column, or new df w/ clean date and merge it.
You should be able to chop off the first 14 characters with a lambda -- leaving you with second listed date in index.
I can't reproduce this, so bear with me.
df.rename(index=lambda s: s[14:])
#should remove first 14 characters from each row label.
#leaving just '2020-02-23' in row 2.
#If you must skip row 1, idx = df.index[1:]
#or df.iloc[1:].rename(index=lambda s: s[1:])
Otherwise, I would just replace it with a new datetime index.
didx = pd.DatetimeIndex(start ='2000-01-10', freq ='D',end='2020-02-26')
#Make sure same length as df
df.set_index(didx)
#Or
#df['new_date'] = didx.values
#df.set_index('new_date').drop(columns=['Date'])
#Or
#df.append(didx,axis=1) #might need ignore_index=True

Categories

Resources