Serialized Array to Columns in Panda DataFrame - python

I have imported a .csv file and it contains a column with a serialized array in it.
How can I make 4 columns out of the array? I already tried somethings with regex and phpserialize package but I could not get it done.
This is how the columns looks:
forecast
---------------------------------------------------------------------------
a:4:{s:5:"sunny";i:10;s:5:"rainy";i:70;s:8:"thundery";i:0;s:5:"snowy";i:20;}
Now i want that the whole column gets seperated in 4 colums like this:
sunny|rainy|thundery|snowy
--------------------------
10 |70 |0 |20
Is there an easy way to do this? Thanks in advance!

If your forecasts are saved as strings in your dataframe, then you can extract your desired values with a regex, then pivot the dataframe. Something like this should help get you started (I've added in a row with new values just to demonstrate):
>>> df
forecast
0 'a:4:{s:5:"sunny";i:10;s:5:"rainy";i:70;s:8:"t...'
1 'a:4:{s:5:"sunny";i:20;s:5:"rainy";i:80;s:8:"t...'
df.forecast.str.extractall('"(?P<column>.*?)";i:(?P<value>\d+)').reset_index(level=0).pivot('level_0','column','value')
column rainy snowy sunny thundery
level_0
0 70 20 10 0
1 80 10 20 5

Related

How to modify code in Python so as to make calculations only on NOT NaN rows in Pandas?

I have Pandas Data Frame in Python like below:
NR
--------
910517196
921122192
NaN
And by using below code I try to calculate age based on column NR in above Data Frame (it does not matter how below code works, I know that it is correct - briefly I take 6 first values to calculate age, because for example 910517 is 1991-05-17 :)):
df["age"] = (ABT_DATE - pd.to_datetime(df.NR.str[:6], format = '%y%m%d')) / np.timedelta64(1, 'Y')
My problem is: I can modify above code to calculate age only using NOT NaN values in column "NR" in Data Frame, nevertheless some values are NaN.
My question is: How can I modify my code so as to take to calculations only these rows from column "NR" where is not NaN ??
As a result I need something like below, so simply I need to temporarily disregard NaN rows and, where there is a NaN in column NR, insert also a NaN in the calculated age column:
NR age
------------------
910517196 | 30
921122192 | 29
NaN | NaN
How can I do that in Python Pandas ?
df['age']=np.where(df['NR'].notnull(),'your_calculation',np.nan)

Python: Excel to data frame : removing top rows and columns that doesnt contain 'right' data

I got a rather very basic excel to Pandas issue which I am unable to get around. Any help in this regard will be highly appreciated.
Source file
I got some data in an excel like below(apologies for pasting a picture and not Table):
Columns A,B,C are not required. I need the highlighted data to be read/moved into a pandas dataframe.
df = pd.read_excel('Miscel.xlsx',sheet_name='Sheet2',skiprows=8, usecols=[3,4,5,6])
df
Date Customers Location Sales
0 2021-10-05 A NSW 12
1 2021-10-03 B NSW 10
2 2021-10-01 C NSW 33
If your data is small, you can also read in and then drop the Nan.
df = pd.read_excel('Miscel.xlsx',sheet_name='Sheet2',skiprows=8).dropna(how='all',axis=1)

Filtering, transposing and concatenating with Pandas

I'm trying something i've never done before and i'm in need of some help.
Basically, i need to filter sections of a pandas dataframe, transpose each filtered section and then concatenate every resulting section together.
Here's a representation of my dataframe:
df:
id | text_field | text_value
1 Date 2021-06-23
1 Hour 10:50
2 Position City
2 Position Countryside
3 Date 2021-06-22
3 Hour 10:45
I can then use some filtering method to isolate parts of my data:
df.groupby('id').filter(lambda x: True)
test = df.query(' id == 1 ')
test = test[["text_field","text_value"]]
test_t = test.set_index("text_field").T
test_t:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
If repeat the process looking for row with id == 3 and then concatenate the result with test_t, i'll have the following:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
text_value | 2021-06-22 | 10:45
I'm aware that performing this with rows where id == 2 will give me other columns and that's alright too, it's what a i want as well.
What i can't figure out is how to do this for every "id" in my dataframe. I wasn't able to create a function or for loop that works. Can somebody help me?
To summarize:
1 - I need to separate my dataframe in sections according with values from the "id" column
2 - After that i need to remove the "id" column and transpose the result
3 - I need to concatenate every resulting dataframe into one big dataframe
You can use pivot_table:
df.pivot_table(
index='id', columns='text_field', values='text_value', aggfunc='first')
Output:
text_field Date Hour Position
id
1 2021-06-23 10:50 NaN
2 NaN NaN City
3 2021-06-22 10:45 NaN
It's not exactly clear how you want to deal with repeating values though, would be great to have some description of that (id=2 would make a good example)
Update: If you want to ignore the ids and simply concatenate all the values:
pd.DataFrame(df.groupby('text_field')['text_value'].apply(list).to_dict())
Output:
Date Hour Position
0 2021-06-23 10:50 City
1 2021-06-22 10:45 Countryside

Extracting discontinuous set of rows in pandas

I have a pandas dataframe containing 100 rows. It looks something like this
id_number Name Age Salary
00001 Alice 50 6.2234
00002 John 29 9.1
.
.
.
00098 Susan 36 11.58
00099 Remy 50 3.7
00100 Walter 50 5.52
From this dataframe, I want to extract the rows corresponding to individuals whose ID numbers do NOT lie between 11 and 20. I want rows 0 to 9, and 20 to 99.
df.iloc allows extracting a continuous set of rows, like 20 to 99, but not 0 to 9 and 20 to 99 in the same go.
I also tried df[(df['id_number'] >= 20) & (df['id_number'] < 10)] but that returns an empty dataframe.
Is there a straightforward way to do this, that does not require doing two separate extractions and their concatenation?
drop sliced index is what you need
in this case we are slicing 11 through to 20
Data
df=pd.Series(np.arange(1,101))
df
drop slice
df.drop(df.loc[11:20].index, inplace=True)
df
This seemed to work (suggested by #FarhoodET):
df[(df['id_number'] => 20) | (df['id_number'] < 10)]

Change unwanted datetime formatted data values into numbers with dashes in Python

I have data that has been changed due to some Excel formatting issues. When there is a number involved with a - dash it automatically changes into a date format.
For example 1-1 changed into 01-Jan, 25-2 changes to 25-Feb in Excel.
But the data with dashes or other values like 1A and 1001 are in tact. When I load the data into Spyder it actually changes format again into a datetime type.
First the data looks like this in Excel:
Name ID Value
Hello 1A 22
Hi 01-Jan 20
What 02-Jan 12
Is 1001 10
Up 25-Mar 11
The data comes up as a Pandas Dataframe format with the current year (2019) in Python with the code:
import pandas as pd
FAC_sheet = pd.read_excel('data', dtype=str)
Name ID Value
Hello 1A 22
Hi 2019-01-01 00:00:00 20
What 2019-01-02 00:00:00 12
Is 1001 10
Up 2019-03-25 00:00:00 11
Is there a way I can change only the strangely date formatted values and keep the rest in tact? The desired output is:
Name ID Value
Hello 1A 22
Hi 1-1 20
What 1-2 12
Is 1001 10
Up 3-25 11
You can try the below to try and override the Date behave auto conversion in pandas (replace Date with the column name):
pandas.read_excel(xlsx, sheet, converters={'Date': str})
From the docs (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html):
converters : dict, optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels.

Categories

Resources