Problem Concating and making a Sum in a Describe of some Columns - python

Here it´s my problem y have two dataframes.
I´m wanting to only make the sum in the column ¨Cantidad¨ because as you see, the other information is the same, I am only wanting to sum the column of ¨Cantidad¨ because that column will be variable.
(Here samples) :
First DF
fac
Tarifa Precio Cantidad Importe $ Porcentaje
3 Vecina 155 87 13485 49.2%
2 Misma Zona 130 72 9360 40.7%
0 Alejada 229 17 3893 9.6%
1 Grande 250 1 250 0.6%
Second DF
fac2
Tarifa Precio Cantidad Importe $ Porcentaje
2 Vecina 155 61 9455 55.5%
1 Misma Zona 130 40 5200 36.4%
0 Alejada 229 9 2061 8.2%
I tried this with no luck:
df_concat = pd.concat([fac,fac2],axis=0)
df_grouped = df_concat.groupby(["Tarifa", "Precio"]).agg({"Cantidad": "sum"}).reset_index()
# Ordenamos el dataframe por las mismas columnas que utilizamos en el groupby
df_result = df_grouped.sort_values(["Tarifa", "Precio"])
# Mostramos el resultado
print(df_result)
The result:
Tarifa Precio Cantidad
2 Vecina 155 87
1 Misma Zona 130 72
0 Alejada 229 17
As you see there is not sum in the column ¨Cantidad¨
Hope can you help me!
Best regards!

r = (pd.concat([df1, df2], ignore_index=True)
.groupby('Tarifa')
.agg({'Precio': 'first', 'Cantidad': sum})
)
print(r)
Precio Cantidad
Tarifa
Alejada 229 26
Grande 250 1
Misma Zona 130 112
Vecina 155 148

Since only one column is variable, you can try doing
df = pd.concat([fac, fac2], axis=0)[['Tarifa', 'Precio', 'Cantidad']]
df_result = df.groupby(['Tarifa', 'Precio']).sum().reset_index()

Related

Populate a dataframe column based on a column of other dataframe

I have a dataframe with the population of a region and i want to populate a column of other dataframe with the same distribution.
The first dataframe looks like this:
Municipio Population Population5000
0 Lisboa 3184984 1291
1 Porto 2597191 1053
2 Braga 924351 375
3 Setúbal 880765 357
4 Aveiro 814456 330
5 Faro 569714 231
6 Leiria 560484 227
7 Coimbra 541166 219
8 Santarém 454947 184
9 Viseu 378784 154
10 Viana do Castelo 252952 103
11 Vila Real 214490 87
12 Castelo Branco 196989 80
13 Évora 174490 71
14 Guarda 167359 68
15 Beja 158702 64
16 Bragança 140385 57
17 Portalegre 120585 49
18 Total 12332794 5000
Basically, the second dataframe has 5000 rows and i want to create a column with a name corresponding to the Municipios from the first df.
My problem is that i dont know how to populate the column with same occurence distribution from the first dataframe.
The final result would be something like this:
Municipio
0 Porto
1 Porto
2 Lisboa
3 Évora
4 Lisboa
5 Aveiro
...
4996 Viseu
4997 Lisboa
4998 Porto
4999 Guarda
5000 Beja
Can someone help me?
I would use a simple comprehension to build a list of size 5000 with as many elements with a town name as the value of Population5000, and optionally shuffle it if you want a random order:
lst = [m for m,n in df.loc[:len(df)-2,
['Municipio', 'Population5000']].to_numpy()
for i in range(n)]
random.shuffle(lst)
result = pd.Series(1, index=lst, name='Municipio')
Initialized with random.seed(0), it gives:
Setúbal 1
Santarém 1
Lisboa 1
Setúbal 1
Aveiro 1
..
Santarém 1
Porto 1
Lisboa 1
Faro 1
Aveiro 1
Name: Municipio, Length: 5000, dtype: int64
You could just do a simple map if you do;
map = dict(zip(DF1['Population5000'], DF1['Municipio']))
DF2['Municipo'] = DF2['Population5000'].map(map)
or just change the population 5000 column name in the map (DF2) to whatever the column containing your population values is called.
map = dict(zip(municipios['Population5000'], municipios['Municipio']))
df['Municipio'] = municipios['Population5000'].map(map)
I tried this as suggested by Amen_90 and the column Municipio from the second dataframe it only gets populated with 1 instance of every Municipio, when i wanted to have the same value_counts as in the column "Population5000" in my first dataframe.
df["Municipio"].value_counts()
Beja 1
Aveiro 1
Bragança 1
Vila Real 1
Porto 1
Santarém 1
Coimbra 1
Guarda 1
Leiria 1
Castelo Branco 1
Viseu 1
Total 1
Faro 1
Portalegre 1
Braga 1
Évora 1
Setúbal 1
Viana do Castelo 1
Lisboa 1
Name: Municipio, dtype: int64

Creating data frame conditionally based on 3 data frames

I have the following 3 data frames:
dfSpa = pd.read_csv(
"sentences and translations/SpanishSentences.csv", sep=',')
print(dfSpa.head())
dfEng = pd.read_csv(
'sentences and translations/EngTranslations.csv', sep=',')
print(dfEng.head())
dfIndex = pd.read_csv(
'sentences and translations/SpaSentencesThatHaveEngTranslations.csv', sep=',')
print(dfIndex.head())
That output the following:
0 1 2
0 2482 spa Tengo que irme a dormir.
1 2487 spa Ahora, Muiriel tiene 20 años.
2 2493 spa Simplemente no sé qué decir...
3 2495 spa Yo estaba en las montañas.
4 2497 spa No sé si tengo tiempo.
0 1 2
0 1277 eng I have to go to sleep.
1 1282 eng Muiriel is 20 now.
2 1287 eng This is never going to end.
3 1288 eng I just don't know what to say.
4 1290 eng I was in the mountains.
0 1
0 2482 1277
1 2487 1282
2 2493 1288
3 2493 693485
4 2495 1290
Colum 0 in dfIndex represents a Spanish sentence in dfSpa and column 1 represents the English translation in dfEng that goes with it. dfSpa has more rows than the other 2 df's so, some sentences do not have english translations. Also, dfIndex is longer than dfEng because there are some duplicate translations with different values such as with 2493, in dfIndex.head(), as shown above.
I am trying to create another data frame that simply has the Spanish sentence in one column and the corresponding English translation in the other column. How could I get this done?
dfIndex.merge(
dfSpa[[0,2]], on=0)[[1,2]].rename(columns={2: "Spa"}).merge(
dfEng, left_on=1, right_on=0).rename(columns={2: "Eng"})[['Spa', 'Eng']]
You could try:
df_n=pd.DataFrame()
df_n['A'] = [df.iloc[x].values for x in dfSpa.loc[:,0]]
df_n['B'] = [df.iloc[x].values for x in dfEng.loc[:,0]]
and then remove duplicated rows using:
df_n = df_n.drop_duplicates(subset = ['A'])
It would be easier to check if you had sample dfs.

Remove duplicated rows in groupby? [duplicate]

This question already has an answer here:
pandas: drop duplicates in groupby 'date'
(1 answer)
Closed 4 years ago.
I'm trying to create a new column in the dataframe called volume. The DF already consists of other columns like market. What I want to do is to group by price and company and then get their count and add it in a new column called volume. Here's what I have:
df['volume'] = df.groupby(['price', 'company']).transform('count')
This does create a new column, however, it's giving me all the rows. I don't need all the rows. For example, before the transformation I would get 4 rows and after the transformation I still get 4 rows but with a new column.
market company price volume
LA EK 206.0 2
LA SQ 206.0 1
LA EK 206.0 2
LA EK 36.0 3
LA EK 36.0 3
LA SQ 36.0 1
LA EK 36.0 3
I'd like to drop the duplicated rows. Is there a query that I can do with groupby that will only show the rows like so:
market company price volume
LA EK 206.0 2
LA SQ 206.0 1
LA SQ 36.0 1
LA EK 36.0 3
Simply drop_duplicates with the columns ['market', 'company', 'price']:
>>> df.drop_duplicates(['market', 'company', 'price'])
market company price volume
0 LA EK 206.0 2
1 LA SQ 206.0 1
3 LA EK 36.0 3
5 LA SQ 36.0 1
Your data contains duplicates, probably because you are only including a subset of the columns. You need something else in your data other than price (e.g. two different days could close at the same price, but you wouldn't aggregate the volume from the two).
Assuming that the price is unique for a given timestamp, market and company and you first sort on your timestamp column if any (not required if there is only one price per company and market):
df = pd.DataFrame({
'company': ['EK', 'SQ', 'EK', 'EK', 'EK', 'SQ', 'EK'],
'date': ['2018-08-13'] * 3 + ['2018-08-14'] * 4,
'market': ['LA'] * 7,
'price': [206] * 3 + [36] * 4})
>>> (df.groupby(['market', 'date', 'company'])['price']
.agg({'price': 'last', 'volume': 'count'}[['price', 'volume']]
.reset_index()
market date company price volume
0 LA 2018-08-13 EK 206 2
1 LA 2018-08-13 SQ 206 1
2 LA 2018-08-14 EK 36 3
3 LA 2018-08-14 SQ 36 1

Parsing CSV file in pandas with commas

I need to create a pandas.DataFrame from a csv file. For that I am using the method pandas.csv_reader(...). The problem with this file is that one or more columns contain commas within the values (I don't control the file format).
I been trying to implement the solution from this question but I get the following error:
pandas.errors.EmptyDataError: No columns to parse from file
For some reason after implementing this solution the csv file I tried fixing is blank.
Here is the code I am using:
# fix csv file
with open ("/Users/username/works/test.csv",'rb') as f,\
open("/Users/username/works/test.csv",'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
row = line.split(',', 4)
writer.writerow(row)
# Manipulate csv file
data = pd.read_csv(os.path.expanduser\
("/Users/username/works/test.csv"),error_bad_lines=False)
Any ideas?
Data overview:
Id0 Id 1 Id 2 Country Company Title Email
23 123 456 AR name cargador email#email.com
24 123 456 AR name Executive assistant email#email.com
25 123 456 AR name Asistente Administrativo email#email.com
26 123 456 AR name Atención al cliente vía telefónica vía online email#email.com
39 123 456 AR name Asesor de ventas email#email.com
40 123 456 AR name inc. International company representative email#email.com
41 123 456 AR name Vendedor de campo email#email.com
42 123 456 AR name PUBLICIDAD ATENCIÓN AL CLIENTE email#email.com
43 123 456 AR name Asistente de Marketing email#email.com
44 123 456 AR name SOLDADOR email#email.com
217 123 456 AR name Se requiere vendedores Loja Quevedo Guayas) email#email.com
218 123 456 AR name Ing. Civil recién graduado Yaruquí email#email.com
219 123 456 AR name ayudantes enfermeria email#email.com
220 123 456 AR name Trip Leader for International Youth Exchange email#email.com
221 123 456 AR name COUNTRY MANAGER / DIRECTOR COMERCIAL email#email.com
250 123 456 AR name Ayudante de Pasteleria email#email.com Asesor email#email.com email#email.com
Pre-parsed CSV:
#,Id 1,Id 2,Country,Company,Title,Email,,,,
23,123,456,AR,name,cargador,email#email.com,,,,
24,123,456,AR,name,Executive assistant,email#email.com,,,,
25,123,456,AR,name,Asistente Administrativo,email#email.com,,,,
26,123,456,AR,name,Atención al cliente vía telefónica , vía online,email#email.com,,,
39,123,456,AR,name,Asesor de ventas,email#email.com,,,,
40,123,456,AR,name, inc.,International company representative,email#email.com,,,
41,123,456,AR,name,Vendedor de campo,email#email.com,,,,
42,123,456,AR,name,PUBLICIDAD, ATENCIÓN AL CLIENTE,email#email.com,,,
43,123,456,AR,name,Asistente de Marketing,email#email.com,,,,
44,123,456,AR,name,SOLDADOR,email#email.com,,,,
217,123,456,AR,name,Se requiere vendedores,, Loja , Quevedo, Guayas),email#email.com
218,123,456,AR,name,Ing. Civil recién graduado, Yaruquí,email#email.com,,,
219,123,456,AR,name,ayudantes enfermeria,email#email.com,,,,
220,123,456,AR,name,Trip Leader for International Youth Exchange,email#email.com,,,,
221,123,456,AR,name,COUNTRY MANAGER / DIRECTOR COMERCIAL,email#email.com,,,,
250,123,456,AR,name,Ayudante de Pasteleria,email#email.com, Asesor,email#email.com,email#email.com,
251,123,456,AR,name,Ejecutiva de Ventas,email#email.com,,,,
If you can assume that for the Comapny, that any commas are followed by spaces, and that all of the remaining errant commas are in the column prior to the email address, then a small parser can be written to process that.
Code:
import csv
import re
VALID_EMAIL = re.compile(r'[^#]+#[^#]+\.[^#]+')
def read_my_csv(file_handle):
# build csv reader
reader = csv.reader(file_handle)
# get the header, and find the e-mail and title columns
header = next(reader)
email_column = header.index('Email')
title_column = header.index('Title')
# yield the header up to the e-mail column
yield header[:email_column+1]
# for each row, go through rebuild columns
for row in reader:
# for each row, put the Company column back together
while row[title_column].startswith(' '):
row[title_column-1] += ',' + row[title_column]
del row[title_column]
# for each row, put the Title column back together
while not VALID_EMAIL.match(row[email_column]):
row[email_column-1] += ',' + row[email_column]
del row[email_column]
yield row[:email_column+1]
Test Code:
with open ("test.csv", 'rU') as f:
generator = read_my_csv(f)
columns = next(generator)
df = pd.DataFrame(generator, columns=columns)
print(df)
Results:
# Id 1 Id 2 Country Company \
0 23 123 456 AR name
1 24 123 456 AR name
2 25 123 456 AR name
3 26 123 456 AR name
4 39 123 456 AR name
5 40 123 456 AR name, inc.
6 41 123 456 AR name
7 42 123 456 AR name
8 43 123 456 AR name
9 44 123 456 AR name
10 217 123 456 AR name
11 218 123 456 AR name
12 219 123 456 AR name
13 220 123 456 AR name
14 221 123 456 AR name
15 250 123 456 AR name
16 251 123 456 AR name
Title Email
0 cargador email#email.com
1 Executive assistant email#email.com
2 Asistente Administrativo email#email.com
3 Atención al cliente vía telefónica , vía online email#email.com
4 Asesor de ventas email#email.com
5 International company representative email#email.com
6 Vendedor de campo email#email.com
7 PUBLICIDAD, ATENCIÓN AL CLIENTE email#email.com
8 Asistente de Marketing email#email.com
9 SOLDADOR email#email.com
10 Se requiere vendedores,, Loja , Quevedo, Guayas) email#email.com
11 Ing. Civil recién graduado, Yaruquí email#email.com
12 ayudantes enfermeria email#email.com
13 Trip Leader for International Youth Exchange email#email.com
14 COUNTRY MANAGER / DIRECTOR COMERCIAL email#email.com
15 Ayudante de Pasteleria email#email.com
16 Ejecutiva de Ventas email#email.com

Simplifying categorical variables with python/pandas

I'm working with an airbnb dataset on Kaggle:
https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings
and want to simplify the values for the language column into 2 groupings - english and non-english.
For instance:
users.language.value_counts()
en 15011
zh 101
fr 99
de 53
es 53
ko 43
ru 21
it 20
ja 19
pt 14
sv 11
no 6
da 5
nl 4
el 2
pl 2
tr 2
cs 1
fi 1
is 1
hu 1
Name: language, dtype: int64
And the result I want it is:
users.language.value_counts()
english 15011
non-english 459
Name: language, dtype: int64
This is sort of the solution I want:
def language_groupings():
for i in users:
if users.language !='en':
replace(users.language.str, 'non-english')
else:
replace(users.language.str, 'english')
return users
users['language'] = users.apply(lambda row: language_groupings)
Except there's obviously something wrong with this as it returns an empty series when I run value_counts on the column.
Try this:
users.language = np.where( users.language !='en', 'non-english', 'english' )
is that what you want?
In [181]: x
Out[181]:
val
en 15011
zh 101
fr 99
de 53
es 53
ko 43
ru 21
it 20
ja 19
pt 14
sv 11
no 6
da 5
nl 4
el 2
pl 2
tr 2
cs 1
fi 1
is 1
hu 1
In [182]: x.groupby(x.index == 'en').sum()
Out[182]:
val
False 459
True 15011

Categories

Resources