I have a dataframe full of scientific paper information.
My Dataframe:
database authors title
0 sciencedirect [{'surname': 'Sharafaldin', 'first_name': 'Iman'}, An eval...
{'surname': 'Lashkari', 'first_name': 'Arash Habibi'}]
1 sciencedirect [{'surname': 'Srinivas', 'first_name': 'Jangirala'}, Governmen...
{'surname': 'Das', 'first_name': 'Ashok Kumar'}]
2 sciencedirect [{'surname': 'Bongiovanni', 'first_name': 'Ivano'}] The last...
3 ieeexplore [Igor Kotenko, Andrey Chechulin] Cyber Attac...
As you can see, the authors column contains a list of dictionarys, but only where the database is sciencedirect. In order to perform some analysis, I need to clean my data. Therefore, my goal is to put the names just into lists like in row 4.
What i want:
# From:
[{'surname': 'Sharafaldin', 'first_name': 'Iman'}, {'surname': 'Lashkari', 'first_name': 'Arash Habibi'}]
# To:
[Iman Sharafaldin, Arash Habibi Lashkari]
My appraoch is to createa two masks, one for the database column, extracting only sciencedirect papers and the other mask is the whole authors column. From these mask, a new dataframe is created, on which column "authors" i run the code shown below. It extracts the author names of each cell and stores them in a list, just as i want it:
scidir_mask = df["database"] == 'sciencedirect'
authors_col = df["authors"] is not None
only_scidir = df[authors_col & scidir_mask]
for cell in only_scidir["authors"]:
# get each list from cell
cell_list = []
for dictionary in cell:
# get the values from dict and reverse into list
name_as_list = [*dictionary.values()][::-1]
# make list from first and surname a string
author = ' '.join(name_as_list)
cell_list.append(author)
So at the end of the above code, the cell_list contains the authors names in the way I want.
But I can't get my head around, on how to store these cell_lists back into the original dataframe.
So, how do I get the authors cell, where the database is sciencedirect,perform my little function and store the output of my function back into the cell?
Idea is create custom function with f-strings and apply only to filtered rows:
scidir_mask = df["database"] == 'sciencedirect'
f = lambda x: [f"{y['first_name']} {y['surname']}" for y in x]
df.loc[scidir_mask, 'authors'] = df.loc[scidir_mask, 'authors'].apply(f)
print (df)
database authors title
0 sciencedirect [Iman Sharafaldin, Arash Habibi Lashkari] An eval
1 sciencedirect [Jangirala Srinivas, Ashok Kumar Das] Governmen
2 sciencedirect [Ivano Bongiovanni] The last
3 ieeexplore [Igor Kotenko, Andrey Chechulin] Cyber Attac
Related
I have the below string and need help on how write an if condition in a for loop that check if the row.startswith('name') then take the value and store is in a variable called name. Similarly for dob as well.
Once the for loop completes the output should be a dictionary as below which i can convert to a pandas dataframe.
'name john\n \n\nDOB\n12/08/1984\n\ncurrent company\ngoogle\n'
This is what i have tried so far but do not know how to get the values into a dictionary
for row in lines.split('\n'):
if row.startswith('name'):
name = row.split()[-1]
Final Ouput
data = {"name":"john", "dob": "12/08/1984"}
Try using a list comprehension and split:
s = '''name
john
dob
12/08/1984
current company
google'''
d = dict([i.splitlines() for i in s.split('\n\n')])
print(d)
Output:
{'name': 'john', 'dob': '12/08/1984', 'current company': 'google'}
I have run into a problem with my Python code.
I am creating a movie filter after i scraped IMDB for certain movies.
However, the problem is that movies with multiple genres will show up identically in my movie_filter.
So my code is following:
def create_movies_drop_down(self):
movies = []
if "genres" in self.mappingQuery:
mapping = [row for row in dataLakeDB["nordisk-film-movie-mapping"].find(dict(genres = self.mappingQuery["genres"]))]
else:
mapping = [row for row in dataLakeDB["nordisk-film-movie-mapping"].find()]
for row in mapping:
if row["title"] not in movies:
movies.append(dict(movie = row["title"][0].upper()+row["title"][1:],
imdbPageID = row["imdbPageID"]))
return movies
Now i.e because the movie "DRUK" has the genres "Comende and drama" it will show up 2 times with the same title and imdb page ID.
I have tried with multiple arguments, but can't seem to solve the specific reason why this happens.
Can anyone help here?
Edit: The mapping for 1 movie is like this:
[{'_id': '6028139039cba4ae2722f8d9', 'castList': '[Rosa Salazar, Christoph Waltz, Jennifer Connelly, Mahershala Ali, Ed Skrein]', 'clientID': 'FILM', 'dcmCampaignID': [''], 'director': 'Robert Rodriguez', 'dv360InsertionOrderID': ['7675053', '7675055', '7675065', '768
3006', '7863461'], 'genres': ['action', 'adventure', 'sci-fi'], 'imdbPageID': '0437086', 'imdbPageURL': 'https://www.imdb.com/title/tt0437086', 'imdbRating': '7.3', 'marathonCountryID': 'PMDK', 'posterURL': 'https://m.media-amazon.com/images/M/MV5BMTQzYWYwYjctY2JhZS00
NTYzLTllM2UtZWY5ZTk0NmYwYzIyXkEyXkFqcGdeQXVyMzgxODM4NjM#.V1_UX182_CR0,0,182,268_AL.jpg', 'title': 'alita: battle angel\xa0(2019)'}
Since movies is a list of dictionaries (which are unhashable), converting it to a set to get rid of duplicates will not work. Instead you have to iterate and append each movie to the movies list on the condition that it does not already exist there. You have already tried to do this with the if statement inside the for loop. The problem is that your if statement is always True because your are checking just for a Title and not for the whole dictionary object. You can fix it like this:
def create_movies_drop_down(self):
movies = []
if "genres" in self.mappingQuery:
mapping = [row for row in
dataLakeDB["nordisk-film-movie-mapping"].find(dict(genres=self.mappingQuery["genres"]))]
else:
mapping = [row for row in dataLakeDB["nordisk-film-movie-mapping"].find()]
for row in mapping:
movie_dic = dict(movie=row["title"][0].upper() + row["title"][1:],
imdbPageID=row["imdbPageID"])
if movie_dic not in movies:
movies.append(movie_dic)
return movies
I don't even know how to approach it as it feels too complex for my level.
Imagine courier tracking numbers and I am receiving some duplicated updates from upstream system in following format:
see attached image or a small piece of code that creates such table:
import pandas as pd
incoming_df = pd.DataFrame({
'Tracking ID' : ['4845','24345', '8436474', '457453', '24345-S2'],
'Previous' : ['Paris', 'Lille', 'Paris', 'Marseille', 'Dijon'],
'Current' : ['Nantes', 'Dijon', 'Dijon', 'Marseille', 'Lyon'],
'Next' : ['Lyone', 'Lyon', 'Lyon', 'Rennes', 'NICE']
})
incoming_df
Obviously, tracking ID 24345-S2 (green arrow) is a duplication of 24345 (red arrow), however, it is not fully duplicated but a newer, updated location information (with history) for the parcel. How do I delete old line 24345 and keep new line 24345-S2 in the data set?
The length of tracking ID can be from 4 to 20 chars but '-S2' is always helpfully appended.
Thank you!
Edit: New solution:
# extract duplicates
duplicates = df['Tracking ID'].str.extract('(.+)-S2').dropna()
# remove older entry if necessary
df = df[~df['Tracking ID'].isin(duplicates[0].unique())]
If the 1234-S2 entry is always lower in the DataFrame than the 1234 entry , you could do something like:
# remove the suffix from all entries
incoming_df['Tracking ID'] = incoming_df['Tracking ID'].apply(lambda x: x.split('-')[0])
# keep only the last entry of the duplicates
incoming_df = incoming_df.drop_duplicates(subset='Tracking ID', keep='last')
I'm using the google sheets API to get data which I then pass to Pandas so I can easily work with the data.
Let's say I want to get a sheet with the following data (depicted as a JSON object as tables weren't presented here well)
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '12345', '8 Leafy Street']
}
The sheets API will return something along the lines of this:
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35', '12345', '8 Leafy Street']
]
}
This is great and allows me to easily pass the column headings and data to Pandas without much fuss. I do this in the following manner:
values = sheets_api_result["values"]
df = pd.DataFrame(values[1:], columns=values[0])
My Problem
If I have a Gsuite Sheet that looks like the below table, depicted as a key:value data type
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '', '']
}
I will receive the following response
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35']
]
}
Note that the length of the two arrays are not unequal, and that instead of None or null values being returned, the data is simply not present in the response.
When working with this data in my code, I end up with an error that looks like this
ValueError: 4 columns passed, passed data had 2 columns
So as far as I can tell I have two options:
Come up with a clever way to pad my response where necessary with None
If possible, instruct the API to return a null value in the JSON where null values exist, especially when the last column(s) have no data at all.
With regards to point 1. I think I can append x None values to the list where x is equal to length_of_column_heading_array - length_of_data_array. This does however seem ugly and perhaps there is a more elegant way of doing it.
And with regards to point 2, I haven't managed to find an answer that helps me.
If anyone has any ideas on how I can solve this, I'd be very grateful.
Cheers!
If anyone is interested, here is how I solved the issue.
First, we need to get all the data from the Sheets API.
# define the names of the tabs I want to get
ranges = ['tab1', 'tab2']
# Call the Sheets API
request = service.spreadsheets().values().batchGet(spreadsheetId=document, ranges=ranges,)
response = request.execute()
Now I want to go through every column and ensure that each row's list contains the same number of elements as the first row which contains the column headings.
# response is the response from google sheets API,
# and from the code above. It contains column headings
# and data from every row.
# valueRanges is the key to access the data.
def extract_case_data(response, keyword):
for obj in response["valueRanges"]:
if keyword in obj["range"]:
values = pad_data(obj["values"])
df = pd.DataFrame(values[1:], columns=values[0])
return df
return None
And finally, the method to pad the data
def pad_data(data: list):
# build a new array with the column heading data
# this is the list which we will return
return_data = [data[0]]
for row in data[1:]:
difference = len(data[0]) - len(row)
new_row = row
# append None to the lists which have a shorter
# length than the column heading list
for count in range(1, difference + 1):
new_row.append(None)
return_data.append(new_row)
return return_data
I'm certainly not saying that this is the best or most elegant solution, but it has done the trick for me.
Hope this helps someone.
Same idea, maybe simpler look:
Get raw values
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=data_range).execute()
raw_values = result.get('values', [])
Then complete while iterating
for row in raw_values:
row = row + [''] * (expected_length - len(row))
I want to create a dictionary that has predefined keys, like this:
dict = {'state':'', 'county': ''}
and read through and get values from a spreadsheet, like this:
for row in range(rowNum):
for col in range(colNum):
and update the values for the keys 'state' (sheet.cell_value(row, 1)) and 'county' (sheet.cell_value(row, 1)) like this:
dict[{}]
I am confused on how to get the state value with the state key and the county value with the county key. Any suggestions?
Desired outcome would look like this:
>>>print dict
[
{'state':'NC', 'county': 'Nash County'},
{'state':'VA', 'county': 'Albemarle County'},
{'state':'GA', 'county': 'Cook County'},....
]
I made a few assumptions regarding your question. You mentioned in the comments that State is at index 1 and County is at index 3; what is at index 2? I assumed that they occur sequentially. In addition to that, there needs to be a way in which you can map the headings to the data columns, hence I used a list to do that as it maintains order.
# A list containing the headings that you are interested in the order in which you expect them in your spreadsheet
list_of_headings = ['state', 'county']
# Simulating your spreadsheet
spreadsheet = [['NC', 'Nash County'], ['VA', 'Albemarle County'], ['GA', 'Cook County']]
list_of_dictionaries = []
for i in range(len(spreadsheet)):
dictionary = {}
for j in range(len(spreadsheet[i])):
dictionary[list_of_headings[j]] = spreadsheet[i][j]
list_of_dictionaries.append(dictionary)
print(list_of_dictionaries)
Raqib's answer is partially correct but had to be modified for use with an actual spreadsheet with row and columns and the xlrd mod. What I did was first use xlrd methods to grab the cell values, that I wanted and put them into a list (similar to the spreadsheet variable raqib has shown above). Not that the parameters sI and cI are the column index values I picked out in a previous step. sI=StateIndex and cI=CountyIndex
list =[]
for row in range(rowNum):
for col in range(colNum):
list.append([str(sheet.cell_value(row, sI)), str(sheet.cell_value(row, cI))])
Now that I have a list of the states and counties, I can apply raqib's solution:
list_of_headings = ['state', 'county']
fipsDic = []
print len(list)
for i in range(len(list)):
temp = {}
for j in range(len(list[i])):
tempDic[list_of_headings[j]] = list[i][j]
fipsDic.append(temp)
The result is a nice dictionary list that looks like this:
[{'county': 'Minnehaha County', 'state': 'SD'}, {'county': 'Minnehaha County', 'state': 'SD', ...}]