AttributeError: 'str' object has no attribute 'values' (compare dataframes) - python

I am a beginner at Python, and I have some issues with a code I wrote.
I have 2 dataframes: one with general informations about books (dfMediaGe), and the other with the names of books which were shown on TV (dfTV).
My goal is to compare them, and to fill the column 'At least 1 TV emission' in dfMediaGe with a 1 if the book appears in the dfTV dataframe.
My difficulty is that the dataframes do not have the same number of lines/columns.
Here is a sample of dfMediaGe :
Titre original_title AUTEUR DATE EDITEUR THEMESIMPLE THEME GENRE rating rating_average ... current_count done_count list_count recommend_count review_count TRADUITDE LANGUEECRITURE NOTE At least 1 TV emission id
0 La souris des dents NaN Roger, Marie-Sabine|Desbons, Marie 01/01/2021 Lito TIPJJ001 Eveil J000100 Jeunesse - Eveil et Fiction / Histoire... GJEU003 Jeunesse / Mini albums|GJEU013 Jeuness... NaN NaN ... 0.0 0.0 0.0 0.0 0.0 NaN fre NaN 0 46220676.0
1 La petite mare du grand crocodile NaN Buteau, Gaëlle|Hudrisier, Cécile 01/01/2021 Lito TIPJJ001 Eveil J000100 Jeunesse - Eveil et Fiction / Histoire... GJEU003 Jeunesse / Mini albums|GJEU013 Jeuness... NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 NaN fre NaN 46220678.0
and here is a sample of dfTV :
Titre AUTEUR DATE EDITEUR GENRE THEMESIMPLE TRADUITDE NOTE THEME LANGUEECRITURE FORMATNUMERIQUE PUBLIC MATIERE LEXIQUE DESCRIPTION
0 Les strates Bagieu, Pénélope 11/12/2021 Gallimard NaN TIPBD001 Albums NaN NaN T090200 Bandes dessinées / Bandes dessinées fre NaN NaN NaN NaN 1 vol. ; illustrations en noir et blanc ; 24 x...
And here is the code I wrote, which is not working at all.
for Titre, r in dfMediaGe.iterrows():
for Titre, r in dfTV.iterrows():
p = 0
if r['Titre'].values == (dfTV['Titre']).values.any():
p = 1
r['Au moins 1 passage TV'].append(p)
I do get this error :
AttributeError: 'str' object has no attribute 'values'
Thank you very much for your help !!

I don't think your two data frames not having the same amount of columns is a problem.
You can achieve what you are looking for using this:
data_dfMediaGe = [
['Les strates Bagieu'],
['La petite mare du grand crocodile'],
['La souris des dents NaN Roger'],
['Movie XYZ']
]
dfMediaGe = pd.DataFrame(data=data_dfMediaGe, columns=['Titre'])
dfMediaGe['Au moins 1 passage TV'] = 0
data_dfTV = [
['La petite mare du grand crocodile'],
['Movie XYZ']
]
dfTV = pd.DataFrame(data=data_dfTV, columns=['Titre'])
for i, row in dfMediaGe.iterrows():
if row['Titre'] in list(dfTV['Titre']):
dfMediaGe.at[i, 'Au moins 1 passage TV'] = 1
print(dfMediaGe)
Titre Au moins 1 passage TV
0 Les strates Bagieu 0
1 La petite mare du grand crocodile 1
2 La souris des dents NaN Roger 0
3 Movie XYZ 1
All you have to do is iterate through rows in dfMediaGe and check if the value in the Titrecolumn is present in dfTV in the Titre column.

Related

Normalize values according to category in pandas

My dataframe looks like this
# pubdate lang Count
# 0 20140619 en 3
# 1 20150308 en 1
# 2 20140207 en 1
# 3 20180319 en 1
# 4 20150223 en 1
I would like to normalize the count by replacing it with itself divided by the average count/year.
For example, here is how the table looks at the end
# pubdate lang Count
# 0 20140619 en 1.5
# 1 20150308 en 1
# 2 20140207 en 0.5
# 3 20180319 en 1
# 4 20150223 en 1
So for year 2014, the average of all counts was (3+1)/(2 rows of 2014) = 2, then each value was divided by it.
I thought about first duplicating the frame, having a year column, then grouping per year, then changing values in first table according to that.
I do not know how to do it in code
You can also calculate step by step keeping all temporary calculations.
First convert pubdate to datetime type and extract year:
import datetime
df['pubdate_year'] = pd.to_datetime(df.pubdate, format='%Y%m%d').dt.to_period('Y')
then group by year and calculate mean of Count:
df['year_mean'] = df.groupby(['pubdate_year']).Count.transform('mean')
finally just divide columns:
df['normalized_count'] = df['Count'] / df['year_mean']
result contains all steps of calculations:
pubdate lang Count pubdate_year year_mean normalized_count
0 20140619 en 3 2014 2 1.5
1 20150308 en 1 2015 1 1.0
2 20140207 en 1 2014 2 0.5
3 20180319 en 1 2018 1 1.0
4 20150223 en 1 2015 1 1.0
If You don't need to keep temporary calculations:
df = df.drop(columns=['Count','pubdate_year','year_mean']).rename(columns={'normalized_count':'Count'})
pubdate lang Count
0 20140619 en 1.5
1 20150308 en 1.0
2 20140207 en 0.5
3 20180319 en 1.0
4 20150223 en 1.0
Here's one slicing on the year substring and using it as a grouper, then divide the Count by the mean with transform to preserve the shape of the dataframe:
df['Count'] /= df.groupby(df.pubdate.astype(str).str[:4]).Count.transform('mean')
print(df)
pubdate lang Count
0 20140619 en 1.5
1 20150308 en 1.0
2 20140207 en 0.5
3 20180319 en 1.0
4 20150223 en 1.0

how to concatenate names in a DF whit with surnames when there are nan entries

well i have this DF in python
folio id_incidente nombre app apm \
0 1 1 SIN DATOS SIN DATOS SIN DATOS
1 131 100085 JUAN DOMINGO GONZALEZ DELGADO
2 132 100085 FRANCISCO JAVIER VELA RAMIREZ
3 133 100087 JUAN CARLOS PEREZ MEDINA
4 134 100088 ARMANDO SALINAS SALINAS
... ... ... ... ... ...
1169697 1223258 866846 IVAN RIVERA SILVA
1169698 1223259 866847 EDUARDO PLASCENCIA MARTINEZ
1169699 1223260 866848 FRANCISCO JAVIER PLASCENCIA MARTINEZ
1169700 1223261 866849 JUAN ALBERTO MARTINEZ ARELLANO
1169701 1223262 866850 JOSE DE JESUS SERRANO GONZALEZ
foto_barandilla fecha_hora_registro
0 1.jpg 0/0/0000 00:00:00
1 131.jpg 2008-08-07 15:42:25
2 132.jpg 2008-08-07 15:50:42
3 133.jpg 2008-08-07 16:37:24
4 134.jpg 2008-08-07 17:18:12
... ... ...
1169697 20200330103123_239288573.jpg 2020-03-30 10:32:10
1169698 20200330103726_1160992585.jpg 2020-03-30 10:38:25
1169699 20200330103837_999151106.jpg 2020-03-30 10:39:44
1169700 20200330104038_29275767.jpg 2020-03-30 10:41:52
1169701 20200330104145_640780023.jpg 2020-03-30 10:45:35
here the app and apm are the mother and father surnames, then i tried these in order to get another column with the whole name
names = {}
for i in range(1,df.shape[0]+1):
try:
names[i] = df["nombre"].iloc[i]+' '+df["app"].iloc[i]+' '+df["apm"].iloc[i]
except:
print(df["folio"].iloc[i], df["nombre"].iloc[i],df["app"].iloc[i],df["apm"].iloc[i])
but i get these
400085 nan nan nan
400631 nan nan nan
401267 nan nan nan
401933 nan nan nan
401942 nan nan nan
402030 nan nan nan
403008 nan nan nan
403010 nan nan nan
403011 nan nan nan
403027 nan nan nan
403384 nan nan nan
403399 nan nan nan
403415 nan nan nan
403430 nan nan nan
404764 nan nan nan
501483 CARLOS ESPINOZA nan
504723 RICARDO JARED LOPEZ ACOSTA nan
506989 JUAN JOSE FLORES OCHOA nan
507376 JOSE DE JESUS VENEGAS nan
.....
i tried to use the fillna.('') like this
df["app"].fillna('')
df["apm"].fillna('')
df["nombre"].fillna('')
but the result is the same, i hope you can help me in order to make the column with the whole name, like name+surname1+surname2
edit: here is my minimal version, the reporte files are(each one) a part of the whole database as show up here,
for i in range(1,31):
exec('reporte_%d = pd.read_excel("/home/workstation/Desktop/fotos/Fotos/Detenidos/Reporte Detenidos CER %d.xlsx", encoding="latin1" )'%(i,i))
reportes = [reporte_1,reporte_2,reporte_3,reporte_4,reporte_5,reporte_6,reporte_7,reporte_8,reporte_9,reporte_10,reporte_11,reporte_12,reporte_13,reporte_14,reporte_15,reporte_16,reporte_17,reporte_18,reporte_19,reporte_20,reporte_21,reporte_22,reporte_23,reporte_24,reporte_25,reporte_26,reporte_27,reporte_28,reporte_29,reporte_30]
df = pd.concat(reportes)
now when i run
df['Full_name'] = [' '.join([y for y in x if pd.notna(y)]) for x in zip(df['nombre'], df['app'], df['apm'])]
i get this error TypeError: sequence item 1: expected str instance, int found
You will ' '.join all the words after removing the null values. It's a string operation and apply(axis=1) gets slow so we can use a list comprehension:
Sample Data
nombre app apm
0 Mr. blah bar
1 blah blah foo
2 NaN NaN NaN
3 blah Mr. bar
4 blah foo Mr.
5 foo Mr. blah
6 NaN Mr. foo
7 blah Mr. NaN
8 NaN bar bar
9 foo Mr. Mr.
Code
df['Full_name'] = [' '.join([y for y in x if pd.notna(y)])
for x in zip(df['nombre'], df['app'], df['apm'])]
# nombre app apm Full_name
#0 Mr. blah bar Mr. blah bar
#1 blah blah foo blah blah foo
#2 NaN NaN NaN # value is the empty string `''`
#3 blah Mr. bar blah Mr. bar
#4 blah foo Mr. blah foo Mr.
#5 foo Mr. blah foo Mr. blah
#6 NaN Mr. foo Mr. foo
#7 blah Mr. NaN blah Mr.
#8 NaN bar bar bar bar
#9 foo Mr. Mr. foo Mr. Mr.
You want to keep your processing within pandas as much as possible. By building a python dictionary with strings, you explode your memory footprint and defeat the purpose of using pandas in the first place. You can use the pandas str.concat method to put the strings together, so nominally its just
df["Concatenated"] = df["nombre"].str.cat([df["app"], df["apm"]], sep=" ")
But it sounds like your dataframe need a bit of cleanup first. Like for instance, what is that "foto_barandilla fecha_hora_registro" stuff half way down? Here is a fully worked example of a clean dataframe and concatenation
import pandas as pd
import re
data = """folio id_incidente nombre app apm
1 1 SIN DATOS SIN DATOS SIN DATOS
131 100085 JUAN DOMINGO GONZALEZ DELGADO
132 100085 FRANCISCO JAVIER VELA RAMIREZ
133 100087 JUAN CARLOS PEREZ MEDINA
134 100088 ARMANDO SALINAS SALINAS
1223258 866846 IVAN RIVERA SILVA
1223259 866847 EDUARDO PLASCENCIA MARTINEZ
1223260 866848 FRANCISCO JAVIER PLASCENCIA MARTINEZ
1223261 866849 JUAN ALBERTO MARTINEZ ARELLANO
1223262 866850 JOSE DE JESUS SERRANO GONZALEZ"""
# make test dataframe
table = []
for line in data.split("\n"):
line = line.strip()
table.append(re.split(r"\s{2,}", line))
df = pd.DataFrame(table[1:], columns=table[0])
# enusre data types and scrub the data
df = df.astype(
{"folio":int, "id_incidente":int, "nombre":"string",
"app":"string", "apm":"string"},errors="ignore")
df.update(df[["nombre", "app", "apm"]].fillna(" "))
# build new column
df["Concatenated"] = df["nombre"].str.cat([df["app"], df["apm"]], sep=" ")
print(df)
# ... or, if you don't want to scrub the dataframe first
df["Concatenated"] = df["nombre"].fillna(" ").str.cat(
[df["app"].fillna(" "), df["apm"].fillna(" ")], sep=" ")
print("================================================")
print(df)

AttributeError: 'NoneType' object has no attribute 'find_all' Beautifulsoup wrong class

I'm trying to scrape a house's price of this link : https://www.bienici.com/recherche/achat/france?page=2
And I need to know what's wrong with my program ?
My program :
from bs4 import BeautifulSoup
import requests
import csv
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow("Prix")
for i in range(1, 20):
url = "https://www.bienici.com/recherche/achat/france?page=%s" % i
soup = BeautifulSoup(requests.get(url).text, "html.parser")
data = soup.find(class_="resultsListContainer")
for data in data.find_all(class_="sideListItemContainerInTableForListInResult"):
prix = data.find("span", {"class": "thePrice"})
prix = prix.text if prix else ""
writer.writerow(prix)
I get this error :
Traceback (most recent call last):
File "1.py", line 16, in <module>
for data in data.find_all(class_="sideListItemContainerInTableForListInResult"):
AttributeError: 'NoneType' object has no attribute 'find_all'
I think my error is in the class_="sideListItemContainerInTableForListInResult" , but when I inspect the html code I think it's the right one !!
It looks like you can access the data through the json response. You'll have to play around with the paramters and dig through the json structure to pull out what you want need, but look like quite a bit of data you are able to grab:
import requests
payload = {'filters': '{"size":24,"from":0,"filterType":"buy","propertyType":["house","flat"],"newProperty":false,"page":2,"resultsPerPage":24,"maxAuthorizedResults":2400,"sortBy":"relevance","sortOrder":"desc","onTheMarket":[true],"limit":"ih{eIzjhZ?q}qrAzaf}AlrD?rvfrA","showAllModels":false,"blurInfoType":["disk","exact"]}'}
url = 'https://www.bienici.com/realEstateAds.json'
response = requests.get(url, params = payload).json()
for prop in response['realEstateAds']:
title = prop['title']
city = prop['city']
desc = prop['description']
price = prop['price']
print ('%s - %s\n%s\n%s' %(price, title, city, desc))
Output:
print ('%s - %s\n%s\n%s' %(price, title, city, desc))
1190000 - BALMA - Bien d'exception 12 pièces 400 m2
Pin-Balma
- EXCLUSIVITÉ - Bien d'exception à BALMA (31130) . - 400 m2 - 12 pièces - 7 chambres - Sur 3000 m2 de terrain. . Luxe, calme et volupté dans cette magnifique et très rare maison divisée en deux lots d'habitation communicants (qui peuvent aussi être indépendants). L'un de 260 m2 et l'autre de 140 m2. Chacun avec son entrée, son séjour, sa cuisine, ses salles de bains et pièces d'eau, ses chambres, ses terrasses et ses équipements de qualité. Et ce, ouvrant le champ des possibles quant aux projets potentiels !. . Cette bâtisse de prestige du milieu du XVIIIème siècle a vu ses rénovations et prestations inhérentes réalisées avec des matériaux et des façons d'excellence.. . Sur les 3000 m2 de terrain, un jardin paysager orne les abords de la maison et de la piscine. Puis, vous trouverez un pré et un bois privé qui réveilleront vos aspirations bucoliques. Vous pourrez ainsi vous blottir dans un écrin précieux niché à proximité de TOULOUSE.. . Vos hôtes et vous serez à proximité des commodités, des transports (dont le métro), des cliniques, des établissements scolaires et des hypermarchés ; et tout aussi proches d'Airbus (BLAGNAC et Défense), du CEAT, des SSII, de Orange Business Services, etc.. . Recevez notre invitation au voyage, là où tout n'est qu'ordre et beauté, Luxe, calme et volupté.. . Visite virtuelle disponible en agence ou en LiveRoom avec un de nos conseillers.
To get to a csv, you'll need to convert to a dataframe. Now the json structure is nested, so there are some columns that won't be entirelly flattened out. There are ways to handle that, but to get a basic dataframe:
from pandas.io.json import json_normalize
df = json_normalize(response['realEstateAds'])
Output:
print (df.to_string())
accountType adCreatedByPro adType adTypeFR addressKnown agencyFeePercentage agencyFeeUrl annualCondominiumFees availableDate balconyQuantity balconySurfaceArea bathroomsQuantity bedroomsQuantity blurInfo.bbox blurInfo.centroid.lat blurInfo.centroid.lon blurInfo.origin blurInfo.position.lat blurInfo.position.lon blurInfo.radius blurInfo.type city condominiumPartsQuantity description descriptionTextLength district.code_insee district.cp district.id district.id_polygone district.id_type district.insee_code district.libelle district.name district.postal_code district.type_id enclosedParkingQuantity endOfPromotedAsExclusive energyClassification energyValue exposition feesChargedTo floor floorQuantity greenhouseGazClassification greenhouseGazValue hasAirConditioning hasAlarm hasBalcony hasCaretaker hasCellar hasDoorCode hasElevator hasFirePlace hasGarden hasIntercom hasPool hasSeparateToilet hasTerrace heating highlightMailContact id isCalm isCondominiumInProcedure isDisabledPeopleFriendly isExclusiveSaleMandate isInCondominium isRefurbished isStudio landSurfaceArea modificationDate needHomeStaging newOrOld newProperty nothingBehindForm parkingPlacesQuantity photos postalCode price priceHasDecreased pricePerSquareMeter priceWithoutFees propertyType publicationDate reference roomsQuantity showerRoomsQuantity status.autoImported status.closedByUser status.highlighted status.is3dHighlighted status.isLeading status.onTheMarket surfaceArea terracesQuantity thresholdDate title toiletQuantity transactionType userRelativeData.accountIds userRelativeData.canChangeOnTheMarket userRelativeData.canModifyAd userRelativeData.canModifyAdBlur userRelativeData.canSeeAddress userRelativeData.canSeeContacts userRelativeData.canSeeExactPosition userRelativeData.canSeePublicationCertificateHtml userRelativeData.canSeePublicationCertificatePdf userRelativeData.canSeeRealDates userRelativeData.canSeeStats userRelativeData.importAccountId userRelativeData.isAdModifier userRelativeData.isAdmin userRelativeData.isFavorite userRelativeData.isNetwork userRelativeData.isOwner userRelativeData.searchAccountIds virtualTours with360 with3dModel workToDo yearOfConstruction
0 agency True buy vente True NaN https://www.immoceros.fr/mentions-legales-hono... 2517.0 NaN NaN NaN 1.0 3 [2.27006, 48.92827, 2.27006, 48.92827] 48.928270 2.270060 manual 48.928270 2.270060 NaN exact Colombes 47.0 COLOMBES | Agent-Sarre - Champarons |\r\nSitué... 1149 92025 92700 100331 100331 1 92025 Fossés Jean Bouvier Colombes - Fossés Jean Bouvier 92700 1 NaN 0 D 197.00 NaN seller 5.0 6.0 B 9.00 NaN NaN False NaN False NaN True NaN False NaN NaN NaN True électricité individuel False ag922079-195213238 NaN NaN NaN True True NaN NaN NaN 2019-05-11T08:53:15.943Z NaN ancien False True 2.0 [{'url_photo': 'http://photos.ubiflow.net/9220... 92700 469000 False 5097.826087 469000.0 flat 2019-04-23T18:42:59.742Z VA1952-IMMOCEROS2 4 1.0 True False False False True True 92.00 NaN NaN COLOMBES | APPARTEMENT A VENDRE | 4 PIECES - ... 1.0 buy [ubiflow-easybusiness-ag922079] False False False False False False False False False False 558bbfd06fbf04e50075bbce False False False False False [ubiflow-easybusiness-ag922079, contract-type-... [{'originalUrl': 'https://www.nodalview.com/bK... True False NaN NaN
1 agency True buy vente True NaN NaN NaN NaN NaN NaN NaN 1 [7.251231, 43.700846999999996, 7.251231, 43.70... 43.700847 7.251231 custom 43.700847 7.251231 NaN exact Nice NaN A vendre à Nice dans le quartier Grosso / Tzar... 1542 06088 06000 300102 300102 1 06088 Parc Impérial - Le Piol Nice - Parc Impérial - Le Piol 06000 1 NaN 0 E 270.00 Ouest NaN NaN 7.0 C 14.00 NaN NaN NaN NaN NaN NaN True NaN NaN True NaN NaN NaN Individuel False apimo-2871096 NaN NaN NaN True NaN NaN NaN NaN 2019-04-25T17:04:27.723Z NaN NaN False True NaN [{'url_photo': 'https://d1qfj231ug7wdu.cloudfr... 06000 215000 False 4383.282365 NaN flat 2019-04-04T18:04:38.323Z 1508 2 NaN True False False False False True 49.05 NaN NaN Nice François Grosso : F2 dernier étage, terra... NaN buy [apimo-3120] False False False False False False False False False False 5913331e150de0009ce38406 False False False False False [apimo-3120, contract-type-basic, 5913331e150d... [{'originalUrl': 'https://www.nodalview.com/PX... True False False NaN
2 agency True buy vente True NaN NaN NaN NaN NaN NaN NaN 0 [7.2526839999999995, 43.69589099999998, 7.2526... 43.695891 7.252684 custom 43.695891 7.252684 NaN exact Nice NaN Joli studio entièrement meublé et équipé à ven... 1205 06088 06000 300070 300070 1 06088 Gambetta Nice - Gambetta 06000 1 NaN 0 D 224.47 Est NaN 2.0 6.0 B 8.49 True NaN NaN NaN NaN NaN True NaN NaN True NaN NaN NaN Individuel False apimo-1008273 NaN NaN NaN True NaN NaN True NaN 2019-04-24T17:02:19.834Z NaN NaN False True NaN [{'url_photo': 'https://d1qfj231ug7wdu.cloudfr... 06000 145000 False 6722.299490 NaN flat 1970-01-01T00:00:00.000Z 1496 1 NaN True False False False False True 21.57 NaN 2019-03-29T09:52:38.387Z Nice proche mer : studio meublé avec balcon NaN buy [apimo-3120] False False False False False False False False False False 5913331e150de0009ce38406 False False False False False [apimo-3120, contract-type-basic, 5913331e150d... [{'originalUrl': 'https://www.nodalview.com/xV... True False False NaN
3 agency True buy vente True NaN http://www.willman.fr/i/redac/honoraires?honof... NaN 2017-12-27T00:00:00.000Z NaN NaN 3.0 3 [7.165397, 43.666337999999996, 7.165397, 43.66... 43.666338 7.165397 accounts 43.666338 7.165397 NaN exact Cagnes-sur-Mer NaN Située dans le quartier recherché des Bréguièr... 1229 0
Then to save it:
df.to_csv('file.csv', index=False)
The problem here is that data is NoneType, i.e. data = soup.find(class_="resultsListContainer") is returning None, which means the for loop will fail.
I don't know enough about the exact problem you're trying to solve to know if this is a problem with your code or if the website sometimes doesn't have anything in the "resultListContainer" class. If it is the case that sometime this is missing, you can do a check before reaching the for loop to make sure the data variable is not None.

Remove duplicated rows in groupby? [duplicate]

This question already has an answer here:
pandas: drop duplicates in groupby 'date'
(1 answer)
Closed 4 years ago.
I'm trying to create a new column in the dataframe called volume. The DF already consists of other columns like market. What I want to do is to group by price and company and then get their count and add it in a new column called volume. Here's what I have:
df['volume'] = df.groupby(['price', 'company']).transform('count')
This does create a new column, however, it's giving me all the rows. I don't need all the rows. For example, before the transformation I would get 4 rows and after the transformation I still get 4 rows but with a new column.
market company price volume
LA EK 206.0 2
LA SQ 206.0 1
LA EK 206.0 2
LA EK 36.0 3
LA EK 36.0 3
LA SQ 36.0 1
LA EK 36.0 3
I'd like to drop the duplicated rows. Is there a query that I can do with groupby that will only show the rows like so:
market company price volume
LA EK 206.0 2
LA SQ 206.0 1
LA SQ 36.0 1
LA EK 36.0 3
Simply drop_duplicates with the columns ['market', 'company', 'price']:
>>> df.drop_duplicates(['market', 'company', 'price'])
market company price volume
0 LA EK 206.0 2
1 LA SQ 206.0 1
3 LA EK 36.0 3
5 LA SQ 36.0 1
Your data contains duplicates, probably because you are only including a subset of the columns. You need something else in your data other than price (e.g. two different days could close at the same price, but you wouldn't aggregate the volume from the two).
Assuming that the price is unique for a given timestamp, market and company and you first sort on your timestamp column if any (not required if there is only one price per company and market):
df = pd.DataFrame({
'company': ['EK', 'SQ', 'EK', 'EK', 'EK', 'SQ', 'EK'],
'date': ['2018-08-13'] * 3 + ['2018-08-14'] * 4,
'market': ['LA'] * 7,
'price': [206] * 3 + [36] * 4})
>>> (df.groupby(['market', 'date', 'company'])['price']
.agg({'price': 'last', 'volume': 'count'}[['price', 'volume']]
.reset_index()
market date company price volume
0 LA 2018-08-13 EK 206 2
1 LA 2018-08-13 SQ 206 1
2 LA 2018-08-14 EK 36 3
3 LA 2018-08-14 SQ 36 1

how to concat values from same column

i wanna concatenate two values from the same column in that column here is my csv file :
Date,Region,TemperatureMax,TemperatureMin,PrecipitationMax,PrecipitationMin
01/01/2016,Champagne Ardenne,12,6,2.5,0.3
02/01/2016,Champagne Ardenne,13,9,3.9,0.6
03/01/2016,Champagne Ardenne,14,7,22.5,12.5
01/01/2016,Bourgogne,9,5,0.1,0
02/01/2016,Bourgogne,11,8,16.3,4.2
03/01/2016,Bourgogne,10,5,12.2,6.3
01/01/2016,Pays de la Loire,12,6,2.5,0.3
02/01/2016,Pays de la Loire,13,9,3.9,0.6
03/01/2016,Pays de la Loire,14,7,22.5,12.5
i want to have Bourgogne Champagne Ardenne instead of having them separated and calculate the average of TemperatureMax, TemperatureMin, PrecipitationMax, PrecipitationMin:
01/01/2016,Bourgogne Champagne Ardenne,10.5,5.5,1.3,0.15
02/01/2016,Bourgogne Champagne Ardenne,12,8.5,10.1,2.4
03/01/2016,Bourgogne Champagne Ardenne,12,6,17.35,9.4
01/01/2016,Pays de la Loire,12,6,2.5,0.3
02/01/2016,Pays de la Loire,13,9,3.9,0.6
03/01/2016,Pays de la Loire,14,7,22.5,12.5
More general solution is first replace by dict and then groupby + aggregate mean:
d = {'Champagne Ardenne':'Bourgogne Champagne Ardenne',
'Bourgogne':'Bourgogne Champagne Ardenne'}
df['Region'] = df['Region'].replace(d)
df1 = df.groupby(['Date', 'Region'], as_index=False, sort=False).mean()
print (df1)
Date Region TemperatureMax TemperatureMin \
0 01/01/2016 Bourgogne Champagne Ardenne 10.5 5.5
1 02/01/2016 Bourgogne Champagne Ardenne 12.0 8.5
2 03/01/2016 Bourgogne Champagne Ardenne 12.0 6.0
3 01/01/2016 Pays de la Loire 12.0 6.0
4 02/01/2016 Pays de la Loire 13.0 9.0
5 03/01/2016 Pays de la Loire 14.0 7.0
PrecipitationMax PrecipitationMin
0 1.30 0.15
1 10.10 2.40
2 17.35 9.40
3 2.50 0.30
4 3.90 0.60
5 22.50 12.50
Use groupby's agg method:
df.groupby('Date').agg({
'Region': lambda g: g.sort_values().str.cat(sep=' '),
'TemperatureMax': 'mean',
'TemperatureMin': 'mean',
'PrecipitationMax': 'mean',
'PrecipitationMin': 'mean'
})
Note that this concatenates regions by alphabetical order.

Categories

Resources