Related
I'm trying to extract the elements of a list to turn it into a CSV file. I have a long list containing string elements. Example:
print(list_x[0])
['NAME\tGill Pizza Ltd v Labour Inspector', 'JUDGE(S)\tWilliam Young, Ellen France & Williams JJ', 'COURT\tSupreme Court', 'FILE NUMBER\tSC 67-2021', 'JUDG DATE\t12 August 2021', 'CITATION\t[2021] NZSC 97', 'FULL TEXT\t PDF judgment', 'PARTIES\tGill Pizza Ltd (First Applicant), Sandeep Singh (Second Applicant), Jatinder Singh (Third Applicant/Fourth Applicant), Mandeep Singh (Fourth Applicant/Third Applicant), A Labour Inspector (Ministry of Business, Innovation and Employment) (Respondent), Malotia Ltd (First Applicant)', 'STATUTES\tEmployment Relations Act 2000 s6(5), s228(1)', 'CASES CITED\tA Labour Inspector (Ministry of Business, Innovation and Employment) v Gill Pizza Ltd [2021] NZCA 192', 'REPRESENTATION\tGG Ballara, SP Radcliffe, JC Catran, HTN Fong', 'PAGES\t2 p', 'LOCATION\tNew Zealand Law Society Library', 'DATE ADDED\tAugust 19, 2021']
Is it possible to do something like:
name_list = []
file_number_list = []
subject_list = []
held_list = []
pages_list = []
date_list = []
for i in range(len(list_x)):
if list_x[i].startswith('NAME'):
name_list.append(list_x[i])
elif list_x[i].startswith('FILE NUMBER'):
file_number_list .append(list_x[i])
elif list_x[i].startswith('SUBJECT'):
subject_list .append(list_x[i])
elif list_x[i].startswith('HELD'):
held_list .append(list_x[i])
elif list_x[i].startswith('PAGES'):
pages_list .append(list_x[i])
elif list_x[i].startswith('DATE ADDED'):
date_list .append(list_x[i])
Any help is appreciated. Cheers.
You can also try something like this:
my_dict_list = []
for item in list_x:
my_dict_list.append(dict(zip([i.split('\t', 1)[0] for i in item], [i.split('\t', 1)[1] for i in item])))
Results:
[{'NAME': 'Gill Pizza Ltd v Labour Inspector',
'JUDGE(S)': 'William Young, Ellen France & Williams JJ',
'COURT': 'Supreme Court',
'FILE NUMBER': 'SC 67-2021',
'JUDG DATE': '12 August 2021',
'CITATION': '[2021] NZSC 97',
'FULL TEXT': ' PDF judgment',
'PARTIES': 'Gill Pizza Ltd (First Applicant), Sandeep Singh (Second Applicant), Jatinder Singh (Third Applicant/Fourth Applicant), Mandeep Singh (Fourth Applicant/Third Applicant), A Labour Inspector (Ministry of Business, Innovation and Employment) (Respondent), Malotia Ltd (First Applicant)',
'STATUTES': 'Employment Relations Act 2000 s6(5), s228(1)',
'CASES CITED': 'A Labour Inspector (Ministry of Business, Innovation and Employment) v Gill Pizza Ltd [2021] NZCA 192',
'REPRESENTATION': 'GG Ballara, SP Radcliffe, JC Catran, HTN Fong',
'PAGES': '2 p',
'LOCATION': 'New Zealand Law Society Library',
'DATE ADDED': 'August 19, 2021'},
{....
....
...}]
I have a DataFrame of books that I removed and reworked some information. However, there are some rows in the column "bookISBN" that have duplicate values, and I want to merge all those rows into one.
I plan to make a new DataFrame where I keep the first values for the url, the ISBN, the title and the genre, but I want to sum the values of the column "genreVotes" in order to create the merge. How can I do this?
Original dataframe:
In [23]: network = data[["bookTitle", "bookISBN", "highestVotedGenre", "genreVotes"]]
network.head().to_dict("list")
Out [23]:
{'bookTitle': ['The Hunger Games',
'Twilight',
'The Book Thief',
'Animal Farm',
'The Chronicles of Narnia'],
'bookISBN': ['9780439023481',
'9780316015844',
'9780375831003',
'9780452284241',
'9780066238500'],
'highestVotedGenre': ['Young Adult',
'Young Adult',
'Historical-Historical Fiction',
'Classics',
'Fantasy'],
'genreVotes': [103407, 80856, 59070, 73590, 26376]}
Duplicates:
In [24]: duplicates = network[network.duplicated(subset=["bookISBN"], keep=False)]
duplicates.loc[(duplicates["bookISBN"] == "9780439023481") | (duplicates["bookISBN"] == "9780375831003")]
Out [24]:
{'bookTitle': ['The Hunger Games',
'The Book Thief',
'The Hunger Games',
'The Book Thief',
'The Book Thief'],
'bookISBN': ['9780439023481',
'9780375831003',
'9780439023481',
'9780375831003',
'9780375831003'],
'highestVotedGenre': ['Young Adult',
'Historical-Historical Fiction',
'Young Adult',
'Historical-Historical Fiction',
'Historical-Historical Fiction'],
'genreVotes': [103407, 59070, 103407, 59070, 59070]}
(In this example the votes were all the same but in some cases the values are different).
Expected output:
{'bookTitle': ['The Hunger Games',
'Twilight',
'The Book Thief',
'Animal Farm',
'The Chronicles of Narnia'],
'bookISBN': ['9780439023481',
'9780316015844',
'9780375831003',
'9780452284241',
'9780066238500'],
'highestVotedGenre': ['Young Adult',
'Young Adult',
'Historical-Historical Fiction',
'Classics',
'Fantasy'],
'genreVotes': [260814, 80856, 177210, 73590, 26376]}
I have a data frame that has different data types (list, dictionary, list of dictionary, strings, etc).
df = pd.DataFrame([{'category': [{'id': 1, 'name': 'House Targaryen'}],
'connection': ['Rhaena Targaryen', 'Aegon Targaryen'],
'description': 'Jon Snow, born Aegon Targaryen, is the son of Lyanna Stark '
'and Rhaegar Targaryen, the late Prince of Dragonstone',
'name': 'Jon Snow'},
{'category': [{'id': 2, 'name': 'House Stark'},
{'id': 3, 'name': 'Nights Watch'}],
'connection': ['Robb Stark', 'Sansa Stark', 'Arya Stark', 'Bran Stark'],
'description': 'After successfully capturing a wight and presenting it to '
'the Lannisters as proof that the Army of the Dead are real, '
'Jon pledges himself and his army to Daenerys Targaryen.',
'name': 'Jon Snow'}])
I want to merge these two rows by Jon Snow and combine all other fields together so it looks like
name category description connection
Jon Snow ['House Targaryen','House Stark','Nights Watch'] Jon Snow, born ...... his army to Daenerys Targaryen. ['Rhaena Targaryen',...,'Bran Stark']
It might be a little tricky with list of dictionaries, since this is a toy example, it only contains two rows, and it's easy to explode it and combine two rows of category together. But I don't think it's practical to do that in my actual data set.
I also thought about using df.groupby('name').aggregate('category': func1,'description':func2, 'connection':func3) but I'm not sure if there's a build-in function for what I need.
Thank yall for helping!
Looking at your data, it might be possible to first do a simple groupby and sum. Then deal with the categories using list comprehension:
import pandas as pd
df = pd.DataFrame([{'category': [{'id': 1, 'name':'House Targaryen'}],
'name': 'Jon Snow',
'description':'Jon Snow, born Aegon Targaryen, is the son of Lyanna Stark and Rhaegar Targaryen, the late Prince of Dragonstone',
'connection':['Rhaena Targaryen', 'Aegon Targaryen']},
{'category': [{'id': 2, 'name': 'House Stark'},{'id': 3, 'name': 'Nights Watch'}],
'name': 'Jon Snow',
'description': 'After successfully capturing a wight and presenting it to the Lannisters as proof that the Army of the Dead are real, '
'Jon pledges himself and his army to Daenerys Targaryen.',
'connection':['Robb Stark', 'Sansa Stark', 'Arya Stark', 'Bran Stark']},
{"category":[{"id":4,"name":"Some house"}],
"name": "Some name",
"description": "some desc",
"connection":["connection 1"]}])
result = df.groupby("name").sum()
result["category"] = [[item.get("name") for item in i] for i in result["category"]]
result.reset_index(inplace=True)
print (result)
#
name category description connection
0 Jon Snow [House Targaryen, House Stark, Nights Watch] Jon Snow, born Aegon Targaryen, is the son of ... [Rhaena Targaryen, Aegon Targaryen, Robb Stark...
1 Some name [Some house] some desc [connection 1]
I have written a script which is opening multiple tabs one by one and taking data from there. Now I am able to get data from the page but when writing in CSV file getting data as per below.
Bedrooms Bathrooms Super area Floor Status
3 See Dimensions 3 See Dimensions 2100 7 (Out of 23 Floors) 3 See Dimensions
Bedrooms Bathrooms Super area Floor Status
3 See Dimensions 3 See Dimensions 2100 7 (Out of 23 Floors) 3 See Dimensions
Bedrooms Bathrooms Super area Floor Status
1 1 520 4 (Out of 40 Floors) 1
Bedrooms Bathrooms Super area Floor Status
3 See Dimensions 3 See Dimensions 2100 7 (Out of 23 Floors) 3 See Dimensions
Bedrooms Bathrooms Super area Floor Status
1 1 520 4 (Out of 40 Floors) 1
In the Status column i am getting wrong value.
I have tried:
# Go through of them and click on each.
for unique_link in my_needed_links:
unique_link.click()
time.sleep(2)
driver.switch_to_window(driver.window_handles[1])
def get_elements_by_xpath(driver, xpath):
return [entry.text for entry in driver.find_elements_by_xpath(xpath)]
search_entries = [
("Bedrooms", "//div[#class='seeBedRoomDimen']"),
("Bathrooms", "//div[#class='p_value']"),
("Super area", "//span[#id='coveredAreaDisplay']"),
("Floor", "//div[#class='p_value truncated']"),
("Lift", "//div[#class='p_value']")]
with open('textfile.csv', 'a+') as f_output:
csv_output = csv.writer(f_output)
# Write header
csv_output.writerow([name for name, xpath in search_entries])
entries = []
for name, xpath in search_entries:
entries.append(get_elements_by_xpath(driver, xpath))
csv_output.writerows(zip(*entries))
get_elements_by_xpath(driver, xpath)
Edit
Entries: as list
[['3 See Dimensions'], ['3 See Dimensions', '4', '3', '1', '2100 sqft', '1400 sqft', '33%', 'Avenue 54', 'Under Construction', "Dec, '20", 'New Property', '₹ 7.90 Cr ₹ 39,50,000 Approx. Registration Charges ₹ 15 Per sq. Unit Monthly\nSee Other Charges', "Santacruz West, Mumbai., Santacruz West, Mumbai - Western Suburbs, Maharashtra What's Nearby", "Next To St Teresa's Convent School & Sacred Heart School on SV Road.", 'East', 'P51800007149 (The project has been registered via MahaRERA registration number: P51800007149 and is available on the website https://maharera.mahaonline.gov.in under registered projects.)', 'Garden/Park, Pool, Main Road', 'Marble, Marbonite, Wooden', '1 Covered', '24 Hours Available', 'No/Rare Powercut', '6', '6', 'Unfurnished', 'Municipal Corporation of Greater Mumbai', 'Freehold', 'Brokers please do not contact', ''], ['2100'], ['7 (Out of 23 Floors)'], ['3 See Dimensions', '4', '3', '1', '2100 sqft', '1400 sqft', '33%', 'Avenue 54 1 Discussion on forum', 'Under Construction', "Dec, '20", 'New Property', '₹ 7.90 Cr ₹ 39,50,000 Approx. Registration Charges ₹ 15 Per sq. Unit Monthly\nSee Other Charges', "Santacruz West, Mumbai., Santacruz West, Mumbai - Western Suburbs, Maharashtra What's Nearby", "Next To St Teresa's Convent School & Sacred Heart School on SV Road.", 'East', 'P51800007149 (The project has been registered via MahaRERA registration number: P51800007149 and is available on the website https://maharera.mahaonline.gov.in under registered projects.)', 'Garden/Park, Pool, Main Road', 'Marble, Marbonite, Wooden', '1 Covered', '24 Hours Available', 'No/Rare Powercut', '6', '6', 'Unfurnished', 'Municipal Corporation of Greater Mumbai', 'Freehold', 'Brokers please do not contact', '']]
[['3 See Dimensions'], ['3 See Dimensions', '4', '3', '1', '2100 sqft', '1400 sqft', '33%', 'Avenue 54 1 Discussion on forum', 'Under Construction', "Dec, '20", 'New Property', '₹ 7.90 Cr ₹ 39,50,000 Approx. Registration Charges ₹ 15 Per sq. Unit Monthly\nSee Other Charges', "Santacruz West, Mumbai., Santacruz West, Mumbai - Western Suburbs, Maharashtra What's Nearby", "Next To St Teresa's Convent School & Sacred Heart School on SV Road.", 'East', 'P51800007149 (The project has been registered via MahaRERA registration number: P51800007149 and is available on the website https://maharera.mahaonline.gov.in under registered projects.)', 'Garden/Park, Pool, Main Road', 'Marble, Marbonite, Wooden', '1 Covered', '24 Hours Available', 'No/Rare Powercut', '6', '6', 'Unfurnished', 'Municipal Corporation of Greater Mumbai', 'Freehold', 'Brokers please do not contact', ''], ['2100'], ['7 (Out of 23 Floors)'], ['3 See Dimensions', '4', '3', '1', '2100 sqft', '1400 sqft', '33%', 'Avenue 54 1 Discussion on forum', 'Under Construction', "Dec, '20", 'New Property', '₹ 7.90 Cr ₹ 39,50,000 Approx. Registration Charges ₹ 15 Per sq. Unit Monthly\nSee Other Charges', "Santacruz West, Mumbai., Santacruz West, Mumbai - Western Suburbs, Maharashtra What's Nearby", "Next To St Teresa's Convent School & Sacred Heart School on SV Road.", 'East', 'P51800007149 (The project has been registered via MahaRERA registration number: P51800007149 and is available on the website https://maharera.mahaonline.gov.in under registered projects.)', 'Garden/Park, Pool, Main Road', 'Marble, Marbonite, Wooden', '1 Covered', '24 Hours Available', 'No/Rare Powercut', '6', '6', 'Unfurnished', 'Municipal Corporation of Greater Mumbai', 'Freehold', 'Brokers please do not contact', '']]
website link: https://www.magicbricks.com/propertyDetails/1-BHK-520-Sq-ft-Multistorey-Apartment-FOR-Sale-Kandivali-West-in-Mumbai&id=4d423333373433343431
Edit 1
my_needed_links = []
list_links = driver.find_elements_by_tag_name("a")
for i in range(0, 2):
# Get unique links.
for link in list_links:
if "https://www.magicbricks.com/propertyDetails/" in link.get_attribute("href"):
if link not in my_needed_links:
my_needed_links.append(link)
# Go through of them and click on each.
for unique_link in my_needed_links:
unique_link.click()
time.sleep(2)
driver.switch_to_window(driver.window_handles[1])
def get_elements_by_xpath(driver, xpath):
return [entry.text for entry in driver.find_elements_by_xpath(xpath)]
search_entries = [
("Bedrooms", "//div[#class='seeBedRoomDimen']"),
("Bathrooms", "//div[#class='p_value']"),
("Super area", "//span[#id='coveredAreaDisplay']"),
("Floor", "//div[#class='p_value truncated']"),
("Lift", "//div[#class='p_value']")]
#with open('textfile.csv', 'a+') as f_output:
entries = []
for name, xpath in search_entries:
entries.append(get_elements_by_xpath(driver, xpath))
data = [entry for entry in entries if len(entry)==28]
df = pd.DataFrame(data)
print (df)
df.to_csv('nameoffile.csv', mode='a',index=False,encoding='utf-8')
#df.to_csv('nameoffile.csv',mode='a', index=False,encoding='utf-8')
get_elements_by_xpath(driver, xpath)
time.sleep(2)
driver.close()
# Switch back to the main tab/window.
driver.switch_to_window(driver.window_handles[0])
Thank you in advance. Please suggest something
The xpath for bathrooms and for lift are the same, therefore you get the same results in these columns. Try to find another way to identify and distinguish between them. You can probably use an index, though if there's another way it's usually preferred.
I could not load the page due to my location. But from your entries, you could do:
#Your selenium imports
import pandas as pd
def get_elements_by_xpath(driver, xpath):
return [entry.text for entry in driver.find_elements_by_xpath(xpath)]
for unique_link in my_needed_links:
unique_link.click()
time.sleep(2)
driver.switch_to_window(driver.window_handles[1])
search_entries = [
("Bedrooms", "//div[#class='seeBedRoomDimen']"), ("Bathrooms", "//div[#class='p_value']"),("Super area", "//span[#id='coveredAreaDisplay']"),("Floor", "//div[#class='p_value truncated']"),("Lift", "//div[#class='p_value']")]
entries = []
for name, xpath in search_entries:
entries.append(get_elements_by_xpath(driver, xpath))
data = [entry for entry in entries if len(entry)>5]
df = pd.DataFrame(data)
df.drop_duplicates(inplace=True)
df.to_csv('nameoffile.csv', sep=';',index=False,encoding='utf-8',mode='a')
get_elements_by_xpath(driver, xpath)
I have a collection of articles in MongoDB that has the following structure:
{
'category': 'Legislature',
'updated': datetime.datetime(2010, 3, 19, 15, 32, 22, 107000),
'byline': None,
'tags': {
'party': ['Peter Hoekstra', 'Virg Bernero', 'Alma Smith', 'Mike Bouchard', 'Tom George', 'Rick Snyder'],
'geography': ['Michigan', 'United States', 'North America']
},
'headline': '2 Mich. gubernatorial candidates speak to students',
'text': [
'BEVERLY HILLS, Mich. (AP) \u2014 Two Democratic and Republican gubernatorial candidates found common ground while speaking to private school students in suburban Detroit',
"Democratic House Speaker state Rep. Andy Dillon and Republican U.S. Rep. Pete Hoekstra said Friday a more business-friendly government can help reduce Michigan's nation-leading unemployment rate.",
"The candidates were invited to Detroit Country Day Upper School in Beverly Hills to offer ideas for Michigan's future.",
'Besides Dillon, the Democratic field includes Lansing Mayor Virg Bernero and state Rep. Alma Wheeler Smith. Other Republicans running are Oakland County Sheriff Mike Bouchard, Attorney General Mike Cox, state Sen. Tom George and Ann Arbor business leader Rick Snyder.',
'Former Republican U.S. Rep. Joe Schwarz is considering running as an independent.'
],
'dateline': 'BEVERLY HILLS, Mich.',
'published': datetime.datetime(2010, 3, 19, 8, 0, 31),
'keywords': "Governor's Race",
'_id': ObjectId('4ba39721e0e16cb25fadbb40'),
'article_id': 'urn:publicid:ap.org:0611e36fb084458aa620c0187999db7e',
'slug': "BC-MI--Governor's Race,2nd Ld-Writethr"
}
If I wanted to write a query that looked for all articles that had at least 1 geography tag, how would I do that? I have tried writing db.articles.find( {'tags': 'geography'} ), but that doesn't appear to work. I've also thought about changing the search parameter to 'tags.geography', but am having a devil of a time figuring out what the search predicate would be.
If the "geography" field doesn't exist when there aren't any tags in it (i.e., it's created when you add a location), you could do:
db.articles.find({tags.geography : {$exists : true}})
If it does exists and is empty (i.e., "geography" : []) you should add a geography_size field or something and do:
db.articles.find({tags.geography_size : {$gte : 1}})
There's more info on queries at http://www.mongodb.org/display/DOCS/Advanced+Queries