I have to open this xml file from this website and make a dataframe.
I tried to pas a xml to dict and then pass to dataframe
from urllib.request import urlopen
import xmltodict
from collections import OrderedDict
from io import StringIO
from collections import OrderedDict, Counter
import pandas as pd
file = urlopen('https://analisi.transparenciacatalunya.cat/download/8s6p-h233/text%2Fxml')
data_bytes = file.read()
orderDictListData = xmltodict.parse(data_bytes)
orderDictListData
df = pd.DataFrame(orderDictListData)
I need a dataframe since key "id" until "codiINEmunicipi" like that:
How about simply using pandas.read_xml:
url = 'https://analisi.transparenciacatalunya.cat/download/8s6p-h233/text%2Fxml'
df = pd.read_xml(url)
output:
id nom carrec tractament resp iddep dep idpare codidep nif ordre datamodificacio datacreacio centres sinonims
0 535 012 Atenció Ciutadana None None None 3392 Departament de la Vicepresidència i de Polítiques Digitals i Territori 6564 PTO None 912000 02/06/2021 19/06/1997 NaN NaN
1 3383 061 Salut Respon None None None 2803 Departament de Salut 7021 SLT None 1000 23/02/2021 19/06/1997 NaN NaN
2 5500 ACCIÓ - Agència per a la Competitivitat de l'Empresa consellera delegada de l'Agència per a la Competitivitat de l'Empresa, ACCIÓ Sra. Natàlia Mas Guix 19775 Departament d'Empresa i Treball 19035 EMO S-0800476-D 323699 28/02/2022 19/06/1997 NaN NaN
3 5504 ACCIÓ a Girona delegat d'ACCIÓ a Girona Sr. Ferran Rodero 19775 Departament d'Empresa i Treball 5500 EMO None 10500 25/01/2016 19/06/1997 NaN NaN
4 5505 ACCIÓ a Lleida delegada d'ACCIÓ a Lleida Sra. Clara Porta Sànchez 19775 Departament d'Empresa i Treball 5500 EMO None 11500 25/01/2016 19/06/1997 NaN NaN
...
I am a beginner at Python, and I have some issues with a code I wrote.
I have 2 dataframes: one with general informations about books (dfMediaGe), and the other with the names of books which were shown on TV (dfTV).
My goal is to compare them, and to fill the column 'At least 1 TV emission' in dfMediaGe with a 1 if the book appears in the dfTV dataframe.
My difficulty is that the dataframes do not have the same number of lines/columns.
Here is a sample of dfMediaGe :
Titre original_title AUTEUR DATE EDITEUR THEMESIMPLE THEME GENRE rating rating_average ... current_count done_count list_count recommend_count review_count TRADUITDE LANGUEECRITURE NOTE At least 1 TV emission id
0 La souris des dents NaN Roger, Marie-Sabine|Desbons, Marie 01/01/2021 Lito TIPJJ001 Eveil J000100 Jeunesse - Eveil et Fiction / Histoire... GJEU003 Jeunesse / Mini albums|GJEU013 Jeuness... NaN NaN ... 0.0 0.0 0.0 0.0 0.0 NaN fre NaN 0 46220676.0
1 La petite mare du grand crocodile NaN Buteau, Gaëlle|Hudrisier, Cécile 01/01/2021 Lito TIPJJ001 Eveil J000100 Jeunesse - Eveil et Fiction / Histoire... GJEU003 Jeunesse / Mini albums|GJEU013 Jeuness... NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 NaN fre NaN 46220678.0
and here is a sample of dfTV :
Titre AUTEUR DATE EDITEUR GENRE THEMESIMPLE TRADUITDE NOTE THEME LANGUEECRITURE FORMATNUMERIQUE PUBLIC MATIERE LEXIQUE DESCRIPTION
0 Les strates Bagieu, Pénélope 11/12/2021 Gallimard NaN TIPBD001 Albums NaN NaN T090200 Bandes dessinées / Bandes dessinées fre NaN NaN NaN NaN 1 vol. ; illustrations en noir et blanc ; 24 x...
And here is the code I wrote, which is not working at all.
for Titre, r in dfMediaGe.iterrows():
for Titre, r in dfTV.iterrows():
p = 0
if r['Titre'].values == (dfTV['Titre']).values.any():
p = 1
r['Au moins 1 passage TV'].append(p)
I do get this error :
AttributeError: 'str' object has no attribute 'values'
Thank you very much for your help !!
I don't think your two data frames not having the same amount of columns is a problem.
You can achieve what you are looking for using this:
data_dfMediaGe = [
['Les strates Bagieu'],
['La petite mare du grand crocodile'],
['La souris des dents NaN Roger'],
['Movie XYZ']
]
dfMediaGe = pd.DataFrame(data=data_dfMediaGe, columns=['Titre'])
dfMediaGe['Au moins 1 passage TV'] = 0
data_dfTV = [
['La petite mare du grand crocodile'],
['Movie XYZ']
]
dfTV = pd.DataFrame(data=data_dfTV, columns=['Titre'])
for i, row in dfMediaGe.iterrows():
if row['Titre'] in list(dfTV['Titre']):
dfMediaGe.at[i, 'Au moins 1 passage TV'] = 1
print(dfMediaGe)
Titre Au moins 1 passage TV
0 Les strates Bagieu 0
1 La petite mare du grand crocodile 1
2 La souris des dents NaN Roger 0
3 Movie XYZ 1
All you have to do is iterate through rows in dfMediaGe and check if the value in the Titrecolumn is present in dfTV in the Titre column.
I have a dataframe with 49 columns. Most of them are categorical (dtype object), some are numerical. As I'm a newbie in data science I tried to plot the Pearson correlation heatmap and see the correlation of independent variables but only numeric variables are taken into account.
So how to get the relation between categorical and numerical variables of a dataframe?
Here is an excerpt of my dataframe:
>>> df1.head(3)
Sexe date_naissance Groupe_dage ville Statut_marital Niveau_de_scolarite Situation_professionnelle Autre_situation_professionnelle Revenu_mensuel Si_connexion_internet Canal_acces_info Autre_canal_acces_info Si_situtation_ville_degradee Si_intention_emigration Besoin_Sante Besoin_Education Besoin_Conditions_de_vie Besoin_Lutte_contre_criminalite Besoin_Emploi Besoin_Lutte_contre_corruption Besoin_Eau_potable Besoin_Infrastructures Besoin_Culture_art Besoin_Amelioration_services_publics Besoin_Acces_logement Besoin_Autres_besoins Non_declaration_besoins Autres_besoins Si_connait_president_commune Si_connait_parlementaires Si_inscrit_LE Si_vote_2016 Intention_vote_2021 Consentement Langue_du_questionnaire region id_reg status nbr_app adherent
0 Une femme 1964-04-15 Entre 45 et 54 ans Al Hoceima Marié et je n'ai pas encore d'enfants à charge 1er cycle universitaire / Licence Je suis independent NaN 5,000-7,499 DHS Oui Internet NaN Je suis d'accord Je ne suis pas d'accord True False True False True False False False False False False False False NaN Oui Oui Oui Oui Je sais déjà pour qui je vais voter en 2021 J'accepter d'être recontacté Arabe Tanger-Tetouan-Al Hoceima 1.0 Qualifié 3.0 True
1 Une femme NaN Entre 18 et 24 ans Tétouan Célibataire 1er cycle universitaire / Licence Je suis journalier, je travaille de temps à a... NaN 1-2,499 DHS Non Internet NaN Je suis d'accord Je suis d'accord True True False False True False False False False False False False False NaN Oui Non Non NaN Je ne voterai pas en 2021 Non Arabe Tanger-Tetouan-Al Hoceima 1.0 NaN NaN NaN
2 Un homme NaN Entre 25 et 34 ans Khenifra Marié et j'ai des enfants à charge Niveau lycée Je suis journalier, je travaille de temps à a... NaN Je préfére ne pas répondre Non Télévision NaN Je suis d'accord Je suis d'accord True False True False True False False False False False False False False NaN Oui Non Non NaN Je vais voter en 2021 mais je ne sais toujours... J'accepter d'être recontacté Arabe Beni Mellal-Khenifra 5.0 Na veut pas répondre 2.0 NaN
My attempt
Following this guide on categorical encoding I tried the following:
# for each column where dtype is object
for column in df1.columns:
if df1[column].dtypes == np.object:
df1[column] = df1[column].astype('category')
df1[column] = df1[column].cat.codes
#Using Pearson Correlation
cor = df1.corr()
mask = np.zeros_like(cor, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(12,10))
sns.heatmap(cor,
vmin=-1,
cmap='coolwarm',
annot=False,
mask = mask);
I guess it doesn't make sense as I'm doing the correlation between categorical variables or numerical and categorical variables.
For correlation between categorical values you can use the corrected Cramer's V, and for correlation between numerical and categorical variables you can use the correlation ratio.
In python3 and pandas I have a list of dictionaries in this format:
a = [{'texto27/2': 'SENADO: PLS 00143/2016, de autoria de Telmário Mota, fala sobre maternidade e sofreu alterações em sua tramitação. Tramitação: Comissão de Assuntos Sociais. Situação: PRONTA PARA A PAUTA NA COMISSÃO. http://legis.senado.leg.br/sdleg-getter/documento?dm=2914881'}, {'texto27/3': 'SENADO: PEC 00176/2019, de autoria de Randolfe Rodrigues, fala sobre maternidade e sofreu alterações em sua tramitação. Tramitação: Comissão de Constituição, Justiça e Cidadania. Situação: PRONTA PARA A PAUTA NA COMISSÃO. http://legis.senado.leg.br/sdleg-getter/documento?dm=8027142'}, {'texto6/4': 'SENADO: PL 05643/2019, de autoria de Câmara dos Deputados, fala sobre violência sexual e sofreu alterações em sua tramitação. Tramitação: Comissão de Direitos Humanos e Legislação Participativa. Situação: MATÉRIA COM A RELATORIA. http://legis.senado.leg.br/sdleg-getter/documento?dm=8015569'}]
I tried to transform it into a dataframe with these commands:
import pandas as pd
df_lista_sentencas = pd.DataFrame(a)
df_lista_sentencas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
texto27/2 1 non-null object
texto27/3 1 non-null object
texto6/4 1 non-null object
dtypes: object(3)
memory usage: 100.0+ bytes
But the generated dataframe has blank lines:
df_lista_sentencas.reset_index()
index texto27/2 texto27/3 texto6/4
0 0 SENADO: PLS 00143/2016, de autoria de Telmário... NaN NaN
1 1 NaN SENADO: PEC 00176/2019, de autoria de Randolfe... NaN
2 2 NaN NaN SENADO: PL 05643/2019, de autoria de Câmara do...
I would like to generate something like this:
texto27/2 texto27/3 texto6/4
SENADO: PLS 00143/2016, de autoria de Telmário... SENADO: PEC 00176/2019, de autoria de Randolfe.. SENADO: PL 05643/2019, de autoria de Câmara do...
Please, does anyone know how I can create a dataframe without blank lines?
May be using bfill:
df = df_lista_sentencas.bfill().iloc[[0]]
print(df)
I'm trying to scrape a house's price of this link : https://www.bienici.com/recherche/achat/france?page=2
And I need to know what's wrong with my program ?
My program :
from bs4 import BeautifulSoup
import requests
import csv
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow("Prix")
for i in range(1, 20):
url = "https://www.bienici.com/recherche/achat/france?page=%s" % i
soup = BeautifulSoup(requests.get(url).text, "html.parser")
data = soup.find(class_="resultsListContainer")
for data in data.find_all(class_="sideListItemContainerInTableForListInResult"):
prix = data.find("span", {"class": "thePrice"})
prix = prix.text if prix else ""
writer.writerow(prix)
I get this error :
Traceback (most recent call last):
File "1.py", line 16, in <module>
for data in data.find_all(class_="sideListItemContainerInTableForListInResult"):
AttributeError: 'NoneType' object has no attribute 'find_all'
I think my error is in the class_="sideListItemContainerInTableForListInResult" , but when I inspect the html code I think it's the right one !!
It looks like you can access the data through the json response. You'll have to play around with the paramters and dig through the json structure to pull out what you want need, but look like quite a bit of data you are able to grab:
import requests
payload = {'filters': '{"size":24,"from":0,"filterType":"buy","propertyType":["house","flat"],"newProperty":false,"page":2,"resultsPerPage":24,"maxAuthorizedResults":2400,"sortBy":"relevance","sortOrder":"desc","onTheMarket":[true],"limit":"ih{eIzjhZ?q}qrAzaf}AlrD?rvfrA","showAllModels":false,"blurInfoType":["disk","exact"]}'}
url = 'https://www.bienici.com/realEstateAds.json'
response = requests.get(url, params = payload).json()
for prop in response['realEstateAds']:
title = prop['title']
city = prop['city']
desc = prop['description']
price = prop['price']
print ('%s - %s\n%s\n%s' %(price, title, city, desc))
Output:
print ('%s - %s\n%s\n%s' %(price, title, city, desc))
1190000 - BALMA - Bien d'exception 12 pièces 400 m2
Pin-Balma
- EXCLUSIVITÉ - Bien d'exception à BALMA (31130) . - 400 m2 - 12 pièces - 7 chambres - Sur 3000 m2 de terrain. . Luxe, calme et volupté dans cette magnifique et très rare maison divisée en deux lots d'habitation communicants (qui peuvent aussi être indépendants). L'un de 260 m2 et l'autre de 140 m2. Chacun avec son entrée, son séjour, sa cuisine, ses salles de bains et pièces d'eau, ses chambres, ses terrasses et ses équipements de qualité. Et ce, ouvrant le champ des possibles quant aux projets potentiels !. . Cette bâtisse de prestige du milieu du XVIIIème siècle a vu ses rénovations et prestations inhérentes réalisées avec des matériaux et des façons d'excellence.. . Sur les 3000 m2 de terrain, un jardin paysager orne les abords de la maison et de la piscine. Puis, vous trouverez un pré et un bois privé qui réveilleront vos aspirations bucoliques. Vous pourrez ainsi vous blottir dans un écrin précieux niché à proximité de TOULOUSE.. . Vos hôtes et vous serez à proximité des commodités, des transports (dont le métro), des cliniques, des établissements scolaires et des hypermarchés ; et tout aussi proches d'Airbus (BLAGNAC et Défense), du CEAT, des SSII, de Orange Business Services, etc.. . Recevez notre invitation au voyage, là où tout n'est qu'ordre et beauté, Luxe, calme et volupté.. . Visite virtuelle disponible en agence ou en LiveRoom avec un de nos conseillers.
To get to a csv, you'll need to convert to a dataframe. Now the json structure is nested, so there are some columns that won't be entirelly flattened out. There are ways to handle that, but to get a basic dataframe:
from pandas.io.json import json_normalize
df = json_normalize(response['realEstateAds'])
Output:
print (df.to_string())
accountType adCreatedByPro adType adTypeFR addressKnown agencyFeePercentage agencyFeeUrl annualCondominiumFees availableDate balconyQuantity balconySurfaceArea bathroomsQuantity bedroomsQuantity blurInfo.bbox blurInfo.centroid.lat blurInfo.centroid.lon blurInfo.origin blurInfo.position.lat blurInfo.position.lon blurInfo.radius blurInfo.type city condominiumPartsQuantity description descriptionTextLength district.code_insee district.cp district.id district.id_polygone district.id_type district.insee_code district.libelle district.name district.postal_code district.type_id enclosedParkingQuantity endOfPromotedAsExclusive energyClassification energyValue exposition feesChargedTo floor floorQuantity greenhouseGazClassification greenhouseGazValue hasAirConditioning hasAlarm hasBalcony hasCaretaker hasCellar hasDoorCode hasElevator hasFirePlace hasGarden hasIntercom hasPool hasSeparateToilet hasTerrace heating highlightMailContact id isCalm isCondominiumInProcedure isDisabledPeopleFriendly isExclusiveSaleMandate isInCondominium isRefurbished isStudio landSurfaceArea modificationDate needHomeStaging newOrOld newProperty nothingBehindForm parkingPlacesQuantity photos postalCode price priceHasDecreased pricePerSquareMeter priceWithoutFees propertyType publicationDate reference roomsQuantity showerRoomsQuantity status.autoImported status.closedByUser status.highlighted status.is3dHighlighted status.isLeading status.onTheMarket surfaceArea terracesQuantity thresholdDate title toiletQuantity transactionType userRelativeData.accountIds userRelativeData.canChangeOnTheMarket userRelativeData.canModifyAd userRelativeData.canModifyAdBlur userRelativeData.canSeeAddress userRelativeData.canSeeContacts userRelativeData.canSeeExactPosition userRelativeData.canSeePublicationCertificateHtml userRelativeData.canSeePublicationCertificatePdf userRelativeData.canSeeRealDates userRelativeData.canSeeStats userRelativeData.importAccountId userRelativeData.isAdModifier userRelativeData.isAdmin userRelativeData.isFavorite userRelativeData.isNetwork userRelativeData.isOwner userRelativeData.searchAccountIds virtualTours with360 with3dModel workToDo yearOfConstruction
0 agency True buy vente True NaN https://www.immoceros.fr/mentions-legales-hono... 2517.0 NaN NaN NaN 1.0 3 [2.27006, 48.92827, 2.27006, 48.92827] 48.928270 2.270060 manual 48.928270 2.270060 NaN exact Colombes 47.0 COLOMBES | Agent-Sarre - Champarons |\r\nSitué... 1149 92025 92700 100331 100331 1 92025 Fossés Jean Bouvier Colombes - Fossés Jean Bouvier 92700 1 NaN 0 D 197.00 NaN seller 5.0 6.0 B 9.00 NaN NaN False NaN False NaN True NaN False NaN NaN NaN True électricité individuel False ag922079-195213238 NaN NaN NaN True True NaN NaN NaN 2019-05-11T08:53:15.943Z NaN ancien False True 2.0 [{'url_photo': 'http://photos.ubiflow.net/9220... 92700 469000 False 5097.826087 469000.0 flat 2019-04-23T18:42:59.742Z VA1952-IMMOCEROS2 4 1.0 True False False False True True 92.00 NaN NaN COLOMBES | APPARTEMENT A VENDRE | 4 PIECES - ... 1.0 buy [ubiflow-easybusiness-ag922079] False False False False False False False False False False 558bbfd06fbf04e50075bbce False False False False False [ubiflow-easybusiness-ag922079, contract-type-... [{'originalUrl': 'https://www.nodalview.com/bK... True False NaN NaN
1 agency True buy vente True NaN NaN NaN NaN NaN NaN NaN 1 [7.251231, 43.700846999999996, 7.251231, 43.70... 43.700847 7.251231 custom 43.700847 7.251231 NaN exact Nice NaN A vendre à Nice dans le quartier Grosso / Tzar... 1542 06088 06000 300102 300102 1 06088 Parc Impérial - Le Piol Nice - Parc Impérial - Le Piol 06000 1 NaN 0 E 270.00 Ouest NaN NaN 7.0 C 14.00 NaN NaN NaN NaN NaN NaN True NaN NaN True NaN NaN NaN Individuel False apimo-2871096 NaN NaN NaN True NaN NaN NaN NaN 2019-04-25T17:04:27.723Z NaN NaN False True NaN [{'url_photo': 'https://d1qfj231ug7wdu.cloudfr... 06000 215000 False 4383.282365 NaN flat 2019-04-04T18:04:38.323Z 1508 2 NaN True False False False False True 49.05 NaN NaN Nice François Grosso : F2 dernier étage, terra... NaN buy [apimo-3120] False False False False False False False False False False 5913331e150de0009ce38406 False False False False False [apimo-3120, contract-type-basic, 5913331e150d... [{'originalUrl': 'https://www.nodalview.com/PX... True False False NaN
2 agency True buy vente True NaN NaN NaN NaN NaN NaN NaN 0 [7.2526839999999995, 43.69589099999998, 7.2526... 43.695891 7.252684 custom 43.695891 7.252684 NaN exact Nice NaN Joli studio entièrement meublé et équipé à ven... 1205 06088 06000 300070 300070 1 06088 Gambetta Nice - Gambetta 06000 1 NaN 0 D 224.47 Est NaN 2.0 6.0 B 8.49 True NaN NaN NaN NaN NaN True NaN NaN True NaN NaN NaN Individuel False apimo-1008273 NaN NaN NaN True NaN NaN True NaN 2019-04-24T17:02:19.834Z NaN NaN False True NaN [{'url_photo': 'https://d1qfj231ug7wdu.cloudfr... 06000 145000 False 6722.299490 NaN flat 1970-01-01T00:00:00.000Z 1496 1 NaN True False False False False True 21.57 NaN 2019-03-29T09:52:38.387Z Nice proche mer : studio meublé avec balcon NaN buy [apimo-3120] False False False False False False False False False False 5913331e150de0009ce38406 False False False False False [apimo-3120, contract-type-basic, 5913331e150d... [{'originalUrl': 'https://www.nodalview.com/xV... True False False NaN
3 agency True buy vente True NaN http://www.willman.fr/i/redac/honoraires?honof... NaN 2017-12-27T00:00:00.000Z NaN NaN 3.0 3 [7.165397, 43.666337999999996, 7.165397, 43.66... 43.666338 7.165397 accounts 43.666338 7.165397 NaN exact Cagnes-sur-Mer NaN Située dans le quartier recherché des Bréguièr... 1229 0
Then to save it:
df.to_csv('file.csv', index=False)
The problem here is that data is NoneType, i.e. data = soup.find(class_="resultsListContainer") is returning None, which means the for loop will fail.
I don't know enough about the exact problem you're trying to solve to know if this is a problem with your code or if the website sometimes doesn't have anything in the "resultListContainer" class. If it is the case that sometime this is missing, you can do a check before reaching the for loop to make sure the data variable is not None.