Python Pandas .str.extract method fails when indexing - python

I'd like to set values on a slice of a DataFrame using .loc using pandas str extract method .str.extract() however, it's not working due to indexing errors. This code works perfectly if I swap extract with contains.
Here is a sample frame:
import pandas as pd
df = pd.DataFrame(
{
'name': [
'JUNK-0003426', 'TEST-0003435', 'JUNK-0003432', 'TEST-0003433', 'TEST-0003436',
],
'value': [
'Junk', 'None', 'Junk', 'None', 'None',
]
}
)
Here is my code:
df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d+)")
How can I set the None values to the extracted regex string

Hmm the problem seems to be that .str.extract returns a pd.DataFrame, you can .squeeze it to turn it into a series and it seems to work fine:
df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d+)").squeeze()
indexing alignment takes care of the rest.

Instead of trying to get the group, you can replace the rest with the empty string:
df.loc[df['value']=='None', 'value'] = df.loc[df['value']=='None', 'name'].str.replace('TEST-\d{3}', '')
Was this answer helpful to your problem?

Here is a way to do it:
df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d+)").loc[:,0]
Output:
name value
0 JUNK-0003426 Junk
1 TEST-0003435 3435
2 JUNK-0003432 Junk
3 TEST-0003433 3433
4 TEST-0003436 3436

Related

Pandas Dataframe from list nested in json

I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code

Extract the country zip code based on the full country code - DataFrame Python

I have in my data information about place as full post code for example CZ25145. I would like to create new column for this with value CZ. How to do this?
I have this:
import pandas as pd
df = pd.DataFrame({
'CODE_LOAD_PLACE' : ['PL43100', 'CZ25905', 'DE29333', 'DE29384', 'SK92832']
},)
I would like to get it like below:
df = pd.DataFrame({
'CODE_LOAD_PLACE' : ['PL43100', 'CZ25905', 'DE29333', 'DE29384', 'SK92832'],
'COUNTRY_LOAD_PLACE' : ['PL', 'CZ', 'DE', 'DE', 'SK']
},)
I try use .factorize and .groupby but no positive final effect.
Use .str and select the first 2 characters:
df["COUNTRY_LOAD_PLACE"] = df["CODE_LOAD_PLACE"].str[:2]

Extract data from specific format in Pandas DF

I have a raw data in csv format which looks like this:
product-name brand-name rating
["Whole Wheat"] ["bb Royal"] ["4.1"]
Expected output:
product-name brand-name rating
Whole Wheat bb Royal 4.1
I want this to affect every entry in my dataset. I have 10,000 rows of data. How can I do this using pandas?
Can we do this using regular expressions? Not sure how to do it.
Thank you.
Edit 1:
My data looks something like this:
df = {
'product-name': [
[""'Whole Wheat'""], [""'Milk'""] ],
'brand-name': [
[""'bb Royal'""], [""'XYZ'""] ],
'rating': [
[""'4.1'""], [""'4.0'""] ]
}
df_p = pd.DataFrame(data=df)
It outputs like this: ["bb Royal"]
PS: Apologies for my programming. I am quite new to programming and also to this community. I really appreciate your help here :)
IIUC select first values of lists:
df = df.apply(lambda x: x.str[0])
Or if values are strings:
df = df.replace('[\[\]]', '', regex=True)
You can use the explode function
df = df.apply(pd.Series.explode)

Google Sheets API and Pandas. Inconsistent data length from the API

I'm using the google sheets API to get data which I then pass to Pandas so I can easily work with the data.
Let's say I want to get a sheet with the following data (depicted as a JSON object as tables weren't presented here well)
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '12345', '8 Leafy Street']
}
The sheets API will return something along the lines of this:
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35', '12345', '8 Leafy Street']
]
}
This is great and allows me to easily pass the column headings and data to Pandas without much fuss. I do this in the following manner:
values = sheets_api_result["values"]
df = pd.DataFrame(values[1:], columns=values[0])
My Problem
If I have a Gsuite Sheet that looks like the below table, depicted as a key:value data type
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '', '']
}
I will receive the following response
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35']
]
}
Note that the length of the two arrays are not unequal, and that instead of None or null values being returned, the data is simply not present in the response.
When working with this data in my code, I end up with an error that looks like this
ValueError: 4 columns passed, passed data had 2 columns
So as far as I can tell I have two options:
Come up with a clever way to pad my response where necessary with None
If possible, instruct the API to return a null value in the JSON where null values exist, especially when the last column(s) have no data at all.
With regards to point 1. I think I can append x None values to the list where x is equal to length_of_column_heading_array - length_of_data_array. This does however seem ugly and perhaps there is a more elegant way of doing it.
And with regards to point 2, I haven't managed to find an answer that helps me.
If anyone has any ideas on how I can solve this, I'd be very grateful.
Cheers!
If anyone is interested, here is how I solved the issue.
First, we need to get all the data from the Sheets API.
# define the names of the tabs I want to get
ranges = ['tab1', 'tab2']
# Call the Sheets API
request = service.spreadsheets().values().batchGet(spreadsheetId=document, ranges=ranges,)
response = request.execute()
Now I want to go through every column and ensure that each row's list contains the same number of elements as the first row which contains the column headings.
# response is the response from google sheets API,
# and from the code above. It contains column headings
# and data from every row.
# valueRanges is the key to access the data.
def extract_case_data(response, keyword):
for obj in response["valueRanges"]:
if keyword in obj["range"]:
values = pad_data(obj["values"])
df = pd.DataFrame(values[1:], columns=values[0])
return df
return None
And finally, the method to pad the data
def pad_data(data: list):
# build a new array with the column heading data
# this is the list which we will return
return_data = [data[0]]
for row in data[1:]:
difference = len(data[0]) - len(row)
new_row = row
# append None to the lists which have a shorter
# length than the column heading list
for count in range(1, difference + 1):
new_row.append(None)
return_data.append(new_row)
return return_data
I'm certainly not saying that this is the best or most elegant solution, but it has done the trick for me.
Hope this helps someone.
Same idea, maybe simpler look:
Get raw values
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=data_range).execute()
raw_values = result.get('values', [])
Then complete while iterating
for row in raw_values:
row = row + [''] * (expected_length - len(row))

Replace values from pandas dataset with dictionary

I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?
Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.

Categories

Resources