mapping two data frames using one common column - python

I have two data frames, (FYI created one using Groupby)
I want to map data array x which looks like the below
Fruit #
Apple 2
Pear 5
lemon 1
into Data Frame y which looks like the below
Date Fruit Cost
Mon Apple 1.00
Mon Pear 2.00
Tues lemon 1.50
Tues Apple 1.00
When mapping into y I want to create new column called #. so the final outcome should look like the below
Date Fruit Cost #
Mon Apple 1.00 2
Mon Pear 2.00 5
Tues lemon 1.50 1
Tues Apple 1.00 2
I have tried using the below
y['#'] = np.where(y['Fruit'].map(x.set_index('Fruit')['#']))
however this kicks out valueerror:Length of values does not match length of index.
The length of both data arrays could vary as well depending on the underlying data. Any suggestions would be most appreciated.
Thanks

Related

How to get consecutive pairs in pandas data frame and find the date difference for valid pairs

Input Data:
sn
fruits
Quality
Date
1
Apple
A
2022-09-01
2
Apple
A
2022-08-15
3
Apple
A
2022-07-15
4
Apple
B
2022-06-01
5
Apple
A
2022-05-15
6
Apple
A
2022-04-15
7
Banana
A
2022-08-15
8
Orange
A
2022-08-15
Get the average date diff for each type of fruit, only if quality=A and there are consecutive record with quality A.
If there are three rows of A quality only first 2 make valid pair. Third one is not valid pair as 4th record is quality=B
So in above data we have 2 valid pairs for Apple 1st pair= (1,2) = 15days date diff and 2nd pair = (5,6) = 15days diff so avg for apple is 15days
Expected output
fruits
avg time diff
Apple
15 days
Banana
null
Orange
null
How can I do this without using any looping in pandas dataframe?

Using pandas, how can I sort a table on all values that contains a string element from a list of string elements?

I have a list of strings looking like this:
strings = ['apple', 'pear', 'grapefruit']
and I have a data frame containing id and text values like this:
id
value
1
The grapefruit is delicious! But the pear tastes awful.
2
I am a big fan og apple products
3
The quick brown fox jumps over the lazy dog
4
An apple a day keeps the doctor away
Using pandas I would like to create a filter which will give me only the id and values for those rows, which contain one or more of the values together with a column, showing which values are contained in the string, like this:
id
value
value contains substrings:
1
The grapefruit is delicious! But the pear tastes awful.
grapefruit, pear
2
I am a big fan og apple products
apple
4
An apple a day keeps the doctor away
apple
How would I write this using pandas?
Use .str.findall:
df['fruits'] = df['value'].str.findall('|'.join(strings)).str.join(', ')
df[df.fruits != '']
id value fruits
0 1 The grapefruit is delicious! But the pear tast... grapefruit, pear
1 2 I am a big fan og apple products apple
3 4 An apple a day keeps the doctor away apple

Pandas group and join

I am new to pandas. I want to analysis the following case. Let say, A fruit market is giving the prices of the fruits daily the time from 18:00 to 22:00. For every half an hour they are updating the price of the fruits between the time lab. Consider the market giving the prices of the fruits at 18:00 as follows,
Fruit Price
Apple 10
Banana 20
After half an hour at 18:30, the list has been updated as follows,
Fruit Price
Apple 10
Banana 21
Orange 30
Grapes 25
Pineapple 65
I want to check has the prices of the fruits been changed of recent one[18:30] with the earlier one[18:00].
Here I want to get the result as,
Fruit 18:00 18:30
Banana 20 21
To solve this I am thinking to do the following,
1) Add time column in the two data frames.
2) Merge the tables into one.
3) Make a Pivot table with Index Fruit name and Column as ['Time','Price'].
I don't know how to get intersect the two data frames of grouped by Time. How to get the common rows of the two Data Frames.
You dont need to pivot in this case, we can simply use merge and use suffixes argument to get the desired results:
df_update = pd.merge(df, df2, on='Fruit', how='outer', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10.0 10.0
1 Banana 20.0 21.0
2 Orange NaN 30.0
3 Grapes NaN 25.0
4 Pineapple NaN 65.0
Edit
Why are we using the outer argument? We want to keep all the new data that is updated in df2. If we use inner for example, we will not get the updated fruits, like below. Unless this is the desired output by OP, which is not clear in this case.
df_update = pd.merge(df, df2, on='Fruit', how='inner', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10 10.0
1 Banana 20 21.0
If Fruit is the index of your data frame the following code should work. The Idea is to return rows with inequality:
df['1800'] = df1['Price']
df['1830'] = df2['Price']
print(df.loc[df['1800'] != df1['1830']])
You can also use datetime in your column heading.

Pandas: Fill in missing indexes with specific ordered values that are already in column.

I have extracted a one-column dataframe with specific values. Now this is what the dataframe looks like:
Commodity
0 Cocoa
4 Coffee
6 Maize
7 Rice
10 Sugar
12 Wheat
Now I want to respectively fill each index that has no value with the value above it in the column so It should look something like this:
Commodity
0 Cocoa
1 Cocoa
2 Cocoa
3 Cocoa
4 Coffee
5 Coffee
6 Maize
7 Rice
8 Rice
9 Rice
10 Sugar
11 Sugar
12 Wheat
I don't seem to get anything from the pandas documentation Working with Text Data. Thanks for your help!
I create a new index with pd.RangeIndex. It works like range so I need to pass it a number one greater than the max number in the current index.
df.reindex(pd.RangeIndex(df.index.max() + 1)).ffill()
Commodity
0 Cocoa
1 Cocoa
2 Cocoa
3 Cocoa
4 Coffee
5 Coffee
6 Maize
7 Rice
8 Rice
9 Rice
10 Sugar
11 Sugar
12 Wheat
First expand the index to include all numbers
s = pd.Series(['Cocoa', 'Coffee', 'Maize', 'Rice', 'Sugar', 'Wheat',], index=[0,4,6,7,10, 12], name='Commodity')
s = s.reindex(range(s.index.max() + 1))
Then do a backfill
s.bfill()

Changing CSV files in python

I have a bunch of CSV files with 4 line headers. In these files, I want to change the values in the sixth column based on the values in the second column. For example, if the second column, under the name PRODUCT is Banana, I would want to change the value in the same row under TIME to 10m. If the the product was Apple I would want the time to be 15m and so on.
When 12:07
Area Produce
Store Name FF
Eatfresh
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 5m
2 Apple 400000 F4 8m
3 Pair 6m
4 Banana 4000 G3 7m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 6m
Desired Output
When 12:07
Area Produce
Store Name FF
Eatfresh
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 10m
2 Apple 400000 F4 15m
3 Pair 6m
4 Banana 4000 G3 10m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 15m
I want to output all of them to be outputed to a directory call NTime. Here is what I have thus far, but being new to coding, I don't really understand a great deal and have gotten stuck on how to make the actual changes. I found Python/pandas idiom for if/then/else and it seems similar to what I want to do, but I don't completely understand what is going on.
import pandas as pd
import glob
import os
fns = glob.glob('*.csv')
colname1 = 'PRODUCT'
colname2 = 'TIME'
for csv in fns:
s = pd.read_csv(csv, usecols=[colname1], squeeze=True, skiprows=4, header=0)
with open(os.path.join('NTime', fn), 'wb') as f:
Can someone help me?
You can do this with a combination of groupby, replace and a dict
In [76]: from pandas import DataFrame
In [77]: fruits = ['banana', 'apple', 'pear', 'banana', 'watermelon', 'orange', 'apple']
In [78]: times = ['5m', '8m', '6m', '7m', '13m', '2m', '6m']
In [79]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m'}
In [80]: df = DataFrame({'fruits': fruits, 'time': times})
Out[80]:
fruits time
0 banana 5m
1 apple 8m
2 pear 6m
3 banana 7m
4 watermelon 13m
5 orange 2m
6 apple 6m
In [81]: def replacer(g, time_map):
....: tv = g.time.values
....: return g.replace(to_replace=tv, value=time_map.get(g.name, tv))
In [82]: df.groupby('fruits').apply(replacer, time_map)
Out[82]:
fruits time
0 banana 10m
1 apple 15m
2 pear 5m
3 banana 10m
4 watermelon 13m
5 orange 2m
6 apple 15m
You said you're new to programming so I'll explain what's going on.
df.groupby('fruits') splits the DataFrame into subsets (which are DataFrames or Series objects) using the values of the fruits column.
The apply method applies a function to each of the aforementioned subsets and concatenates the result (if needed).
replacer is where the "magic" happens: each group's time values get replaced (to_replace) with the new value that's defined in time_map. The get method of dicts allows you to provide a default value if the key you're searching for (the fruit name in this case) is not there. nan is commonly used for this purpose, but here I'm actually just using the time that was already there if there isn't a new one defined for it in the time_map dict.
One thing to note is my use of g.name. This doesn't normally exist as an attribute on DataFrames (you can of course define it yourself if you want to), but is there so you can perform computations that may require the group name. In this case that's the "current" fruit you're looking at when you apply your function.
If you have a new value for each fruit or you write in the old values manually you can shorten this to a one-liner:
In [130]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m', 'orange': '10m', 'watermelon': '100m'
}
In [131]: s = Series(time_map, name='time')
In [132]: s[df.fruits]
Out[132]:
fruits
banana 10m
apple 15m
pear 5m
banana 10m
watermelon 100m
orange 10m
apple 15m
Name: time, dtype: object
In [133]: s[df.fruits].reset_index()
Out[133]:
fruits time
0 banana 10m
1 apple 15m
2 pear 5m
3 banana 10m
4 watermelon 100m
5 orange 10m
6 apple 15m
Assuming that your data is in a Pandas DataFrame and looks something like this:
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 10m
2 Apple 400000 F4 15m
3 Pair 6m
4 Banana 4000 G3 10m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 15m
Then you should be able to do manipulate values in one column based on values in another column (same row) using simple loops like this:
for numi, i in enumerate(df["PRODUCT"]):
if i == "Banana":
df["TIME"][numi] = "10m"
if i == "Apple":
df["TIME"][numi] = "15m"
The code first loops through the rows of the dataframe column "PRODUCT", with the row value stored as i and the row-number stored as numi. It then uses if statements to identify the different levels of interest in the Product column. For those rows with the levels of interest (eg "Banana" or "Apple"), it uses the row-numbers to change the value of another column in the same row.
There are lots of ways to do this, and depending on the size of your data and the number of levels (in this case "Products") you want to change, this isn't necessarily the most efficient way to do this. But since you're a beginner, this will probably be a good basic way of doing it for you to start with.

Categories

Resources