I'm analyzing club participation. Getting data as json through url request. This is the json I get and load with json_loads:
df = [{"club_id":"1234", "sum_totalparticipation":227, "level":1, "idsubdatatable":1229, "segment": "club_id==1234;eventName==national%2520participation,eventName==local%2520partipation,eventName==global%2520participation", "subtable":[{"label":"national participation", "sum_events_totalevents":105,"level":2},{"label":"local participation","sum_events_totalevents":100,"level":2},{"label":"global_participation","sum_events_totalevents":22,"level":2}]}]
when I use json_normalize, this is how df looks:
so, specific participations are aggregated and only sum is available, and I need them flatten, with global/national/local participation in separate rows.
Can you help by providing code?
If you want to see the details of the subtable field (which is another list of dictionaries itself), then you can do the following:
...
df = pd.DataFrame(*data)
for i in range(len(df)):
df.loc[i, 'label'] = df.loc[i, 'subtable']['label']
df.loc[i, 'sum_events_totalevents'] = df.loc[i, 'subtable']['sum_events_totalevents']
df.loc[i, 'sublevel'] = int(df.loc[i, 'subtable']['level'])
Note: I purposely renamed the level field inside the subtable as sublevel, the reason is there is already a column named level in the dataframe, and thus avoiding name conflict
The data you show us after your json.load looks quite dirty, some quotes look missing, especially after "segment":"club_id==1234", and the ; separator at the beginning does not fit the keys separator inside a dict.
Nonetheless, let's consider the data you get is supposed to look like this (a list of dictionaries):
import pandas as pd
data = [{"club_id":"1234", "sum_totalparticipation":227, "level":1, "idsubdatatable":1229, "segment": "club_id==1234;eventName==national%2520participation,eventName==local%2520partipation,eventName==global%2520participation",
"subtable":[{"label":"national participation", "sum_events_totalevents":105,"level":2},{"label":"local participation","sum_events_totalevents":100,"level":2},{"label":"global_participation","sum_events_totalevents":22,"level":2}]}]
You can see the result with rows separated by unpacking your data inside a DataFrame:
df = pd.DataFrame(*data)
This is the table we get:
Hope this helps
Related
I am trying to add a column to a df (large Excel imported as df with Panda). The new column would be the output errors of using Language Tool import when applied to a column in the df. So for each row, I'd have the errors or blank/no errors in new column 'Issues'
import language_tool_python
import pandas as pd
tool = language_tool_python.LanguageTool('en-US')
fn = "Example.xlsx"
xlreader = pd.read_excel(fn, sheet_name="This is Starting File")
for row in xlreader:
text= str(xlreader[['Description']])
xlreader['Issues'] = tool.check(text)
The above results in a ValueError.
I also tried,
xlreader['Issues'] = xlreader.apply(lambda x: tool.check(text))
The result was NaN, even though there are errors.
Is there a way to accomplish the desired output?
Desired output:
ID
Description
Added column 'Issues'
1-432
"The text withissues to check"
Possible spelling mistake
Maybe do thé changes:
To cast as str:
xlreader['Description'].astype('str')
To apply the function:
xlreader['Issues'] = xlreader['Description'].apply(lambda x: tool.check(x))
This is not my actual data, just a representation
I'm trying to save a pandas dataframe into a postgres database using "to_sql". When i try to do so, however, i get an error "column "my_column" specified more than once".
Thing is, i've taken some precations because i know i can have duplicated columns in this project.
I'm using this funcion to add counters to repeated columns:
def df_column_uniquify(df):
df_columns = df.columns
new_columns = []
for item in df_columns:
counter = 0
newitem = item
while newitem in new_columns:
counter += 1
newitem = "{}_{}".format(counter, item)
new_columns.append(newitem)
df.columns = new_columns
return df
So if i have two columns named "my_column_is_ok", using this function gives me:
"1_my_column_is_ok" and "2_my_column_is_ok".
Problem is, when dealing with larger column names, postgres doesn't seem to read the entire name before acusing it as a duplicate.
So if a have this situation:
column A:
"my_column_is_ok_but_it_is_sorta_long"
column B:
"my_column_is_ok_but_it_is_sorta_short"
This error returns something like
"column "my_column_is_ok_but_it_is_sorta" specified more than once"
My counter function doesn't work because theese names are not the same, not in its entirety.
If anyone can help me to understand how to deal with this, i would be very thankful.
I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)
I’m trying to import an excel file and search for a specific record
Here’s what I have come up with so far, which keeps throwing error.
The excel spread sheet has two columns Keyword and Description, each keyword is around 10 characters max, and description is around 150 characters max.
I can print the whole sheet in the excel file without any errors using print(df1) but as soon as I try to search for a specific value it errors out.
Error
ValueError: ('Lengths must match to compare', (33,), (1,))
Code
import pandas as pd
file = 'Methods.xlsx'
df = pd.ExcelFile(file)
df1 = df.parse('Keywords')
lookup = df1['Description'].where(df1['Keyword']==["as"])
print (lookup)
the filter syntax is like this
df_filtered = df[df[COLUMN]==KEYWORD]
so in your case it'd be
lookup = df1[df1['Keyword'] == "as"]['Description']
or the whole code
import pandas as pd
file = 'Methods.xlsx'
df = pd.ExcelFile(file)
df1 = df.parse('Keywords')
lookup = df1[df1['Keyword'] == "as"]['Description']
print (lookup)
breaking it down:
is_keyword = df1['Keyword'] == "as"
this would return a series containing True or False depending on if the keyword was present.
then we can filter the dataframe to get those rows that have True with.
df_filtered = df1[is_keyword]
this will result in all the columns, so to get only the Description column we get it by
lookup = df_filtered['description']
or in one line
lookup = df1[df1['Keyword'] == "as"]['Description']
adding to the elaborate answer given by #Jimmar above:
Just for syntactical convenience, you could write the code like this:
lookup = df1[df1.keyword == "as"].Description
Pandas provides column name lookup like it is a member of DataFrame class( use of dot notation). Please note that the for using this way the column names should not have any spaces in them
I'm wanting to aggregate some API responses into a DataFrame.
The request consistently returns a number of json key value pairs, lets say A,B,C. occasionally however it will return A,B,C,D.
I would like something comparible to SQL's OUTER JOIN, that will simply add the new row, whilst filling the corresponding previous columns as NULL or some other placeholder.
The pandas join options insist upon imposing a unique suffix for the side, I really don't want this.
Am I looking at this the wrong way?
If there is no easy solution, I could just select a subset of the consistently available columns but I really wanted to download the lot and do the processing as a separate stage.
You can use pandas.concat as it provides with all the functionality required for your problem. Let this toy problem illustrate the possible solution.
# This generates random data with some key and value pair.
def gen_data(_size):
import string
keys = list(string.ascii_uppercase)
return dict((k,[v]) for k,v in zip(np.random.choice(keys, _size),np.random.randint(1000, size=_size)))
counter = 0
df = pd.DataFrame()
while True:
if counter > 5:
break;
# Recieve the data
new_data = gen_data(5)
# Converting this to dataframe obj
new_data = pd.DataFrame(new_data)
# Appending this data to my stack
df = pd.concat((df, new_data), axis=0, sort=True)
counter += 1
df.reset_index(drop=True, inplace=True)
print(df.to_string())