I have a bunch of CSV files with 4 line headers. In these files, I want to change the values in the sixth column based on the values in the second column. For example, if the second column, under the name PRODUCT is Banana, I would want to change the value in the same row under TIME to 10m. If the the product was Apple I would want the time to be 15m and so on.
When 12:07
Area Produce
Store Name FF
Eatfresh
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 5m
2 Apple 400000 F4 8m
3 Pair 6m
4 Banana 4000 G3 7m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 6m
Desired Output
When 12:07
Area Produce
Store Name FF
Eatfresh
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 10m
2 Apple 400000 F4 15m
3 Pair 6m
4 Banana 4000 G3 10m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 15m
I want to output all of them to be outputed to a directory call NTime. Here is what I have thus far, but being new to coding, I don't really understand a great deal and have gotten stuck on how to make the actual changes. I found Python/pandas idiom for if/then/else and it seems similar to what I want to do, but I don't completely understand what is going on.
import pandas as pd
import glob
import os
fns = glob.glob('*.csv')
colname1 = 'PRODUCT'
colname2 = 'TIME'
for csv in fns:
s = pd.read_csv(csv, usecols=[colname1], squeeze=True, skiprows=4, header=0)
with open(os.path.join('NTime', fn), 'wb') as f:
Can someone help me?
You can do this with a combination of groupby, replace and a dict
In [76]: from pandas import DataFrame
In [77]: fruits = ['banana', 'apple', 'pear', 'banana', 'watermelon', 'orange', 'apple']
In [78]: times = ['5m', '8m', '6m', '7m', '13m', '2m', '6m']
In [79]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m'}
In [80]: df = DataFrame({'fruits': fruits, 'time': times})
Out[80]:
fruits time
0 banana 5m
1 apple 8m
2 pear 6m
3 banana 7m
4 watermelon 13m
5 orange 2m
6 apple 6m
In [81]: def replacer(g, time_map):
....: tv = g.time.values
....: return g.replace(to_replace=tv, value=time_map.get(g.name, tv))
In [82]: df.groupby('fruits').apply(replacer, time_map)
Out[82]:
fruits time
0 banana 10m
1 apple 15m
2 pear 5m
3 banana 10m
4 watermelon 13m
5 orange 2m
6 apple 15m
You said you're new to programming so I'll explain what's going on.
df.groupby('fruits') splits the DataFrame into subsets (which are DataFrames or Series objects) using the values of the fruits column.
The apply method applies a function to each of the aforementioned subsets and concatenates the result (if needed).
replacer is where the "magic" happens: each group's time values get replaced (to_replace) with the new value that's defined in time_map. The get method of dicts allows you to provide a default value if the key you're searching for (the fruit name in this case) is not there. nan is commonly used for this purpose, but here I'm actually just using the time that was already there if there isn't a new one defined for it in the time_map dict.
One thing to note is my use of g.name. This doesn't normally exist as an attribute on DataFrames (you can of course define it yourself if you want to), but is there so you can perform computations that may require the group name. In this case that's the "current" fruit you're looking at when you apply your function.
If you have a new value for each fruit or you write in the old values manually you can shorten this to a one-liner:
In [130]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m', 'orange': '10m', 'watermelon': '100m'
}
In [131]: s = Series(time_map, name='time')
In [132]: s[df.fruits]
Out[132]:
fruits
banana 10m
apple 15m
pear 5m
banana 10m
watermelon 100m
orange 10m
apple 15m
Name: time, dtype: object
In [133]: s[df.fruits].reset_index()
Out[133]:
fruits time
0 banana 10m
1 apple 15m
2 pear 5m
3 banana 10m
4 watermelon 100m
5 orange 10m
6 apple 15m
Assuming that your data is in a Pandas DataFrame and looks something like this:
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 10m
2 Apple 400000 F4 15m
3 Pair 6m
4 Banana 4000 G3 10m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 15m
Then you should be able to do manipulate values in one column based on values in another column (same row) using simple loops like this:
for numi, i in enumerate(df["PRODUCT"]):
if i == "Banana":
df["TIME"][numi] = "10m"
if i == "Apple":
df["TIME"][numi] = "15m"
The code first loops through the rows of the dataframe column "PRODUCT", with the row value stored as i and the row-number stored as numi. It then uses if statements to identify the different levels of interest in the Product column. For those rows with the levels of interest (eg "Banana" or "Apple"), it uses the row-numbers to change the value of another column in the same row.
There are lots of ways to do this, and depending on the size of your data and the number of levels (in this case "Products") you want to change, this isn't necessarily the most efficient way to do this. But since you're a beginner, this will probably be a good basic way of doing it for you to start with.
Related
I have a dataframe that is sturctured like below, but with 300 different products and about 20.000 orders.
Order
Avocado
Mango
Chili
1546
500
20
0
861153
200
500
5
1657446
500
20
0
79854
200
500
1
4654
500
20
0
74654
0
500
800
I found out what combinations are often together with this code (abbrivated here to 3 products).
size = df.groupby(['AVOCADO', 'MANGO', 'CHILI'], as_index=False).size().sort_values(by=['size'], ascending=False)
Now I want to know per product how often it is bought solo and how often with other products.
Something like this would be my ideal output (fictional numbers) where the percentage shows what percentage of total orders with that product had the other products as well:
Product
Avocado
Mango
Chili
AVOCADO
100%
20 %
1%
MANGO
20 %
100%
3%
CHILI
20%
30%
100%
First we replace actual quantities by 1s and 0s to indicate if the products were in the order or not:
df2 = 1*(df.set_index('Order') > 0)
Then I think the easiest is just to use matrix algebra wrapped into a dataframe. Also given the size of your data it is a good idea to go directly to numpy rather than try to manipulate the dataframe.
For actual numbers of orders that contain (product1,product2), we can do
df3 = pd.DataFrame(data = df2.values.T#df2.values, columns = df2.columns, index = df2.columns)
df3 looks like this:
Avocado Mango Chili
------- --------- ------- -------
Avocado 5 5 2
Mango 5 6 3
Chili 2 3 3
eg there are 2 orders that contain Avocado and Chili
If you want percentages as in your question, we need to divide by the total number of orders with the given product. Again I htink going to numpy directly is best:
df4 = pd.DataFrame(data = ( (df2.values/np.sum(df2.values,axis=0)).T#df2.values), columns = df2.columns, index = df2.columns)
df4 is:
Avocado Mango Chili
------- --------- ------- -------
Avocado 1 1 0.4
Mango 0.833333 1 0.5
Chili 0.666667 1 1
the 'main' product is in the index and its companion in column so for example for products with Mango, 0.833333 had avocado and 0.5 had Chili
I have a list of strings looking like this:
strings = ['apple', 'pear', 'grapefruit']
and I have a data frame containing id and text values like this:
id
value
1
The grapefruit is delicious! But the pear tastes awful.
2
I am a big fan og apple products
3
The quick brown fox jumps over the lazy dog
4
An apple a day keeps the doctor away
Using pandas I would like to create a filter which will give me only the id and values for those rows, which contain one or more of the values together with a column, showing which values are contained in the string, like this:
id
value
value contains substrings:
1
The grapefruit is delicious! But the pear tastes awful.
grapefruit, pear
2
I am a big fan og apple products
apple
4
An apple a day keeps the doctor away
apple
How would I write this using pandas?
Use .str.findall:
df['fruits'] = df['value'].str.findall('|'.join(strings)).str.join(', ')
df[df.fruits != '']
id value fruits
0 1 The grapefruit is delicious! But the pear tast... grapefruit, pear
1 2 I am a big fan og apple products apple
3 4 An apple a day keeps the doctor away apple
I have a csv that i want to count how many rows that match specific columns, what would be the best way to do this? So for example if this was the csv:
fruit days characteristic1 characteristic2
0 apple 1 red sweet
1 orange 2 round sweet
2 pineapple 5 prickly sweet
3 apple 4 yellow sweet
the output i would want would be
1 apple: red,sweet
A csv is a file with values that are separated by commas. I would recommend turning this into a .txt file and using this same format. Then establish consistent spacing throughout your file (using tab for example). So that when you loop through each line it knows where the actual information is. Then when you what info is in what column you print those specific values.
# Use a tab in between each column
fruit days charac1 charac2
0 apple1 1 red sweet
1 orange 2 round sweet
2 pineapple 5 prickly sweet
3 apple 4 yellow sweet
This is just to get you started.
I am new to pandas. I want to analysis the following case. Let say, A fruit market is giving the prices of the fruits daily the time from 18:00 to 22:00. For every half an hour they are updating the price of the fruits between the time lab. Consider the market giving the prices of the fruits at 18:00 as follows,
Fruit Price
Apple 10
Banana 20
After half an hour at 18:30, the list has been updated as follows,
Fruit Price
Apple 10
Banana 21
Orange 30
Grapes 25
Pineapple 65
I want to check has the prices of the fruits been changed of recent one[18:30] with the earlier one[18:00].
Here I want to get the result as,
Fruit 18:00 18:30
Banana 20 21
To solve this I am thinking to do the following,
1) Add time column in the two data frames.
2) Merge the tables into one.
3) Make a Pivot table with Index Fruit name and Column as ['Time','Price'].
I don't know how to get intersect the two data frames of grouped by Time. How to get the common rows of the two Data Frames.
You dont need to pivot in this case, we can simply use merge and use suffixes argument to get the desired results:
df_update = pd.merge(df, df2, on='Fruit', how='outer', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10.0 10.0
1 Banana 20.0 21.0
2 Orange NaN 30.0
3 Grapes NaN 25.0
4 Pineapple NaN 65.0
Edit
Why are we using the outer argument? We want to keep all the new data that is updated in df2. If we use inner for example, we will not get the updated fruits, like below. Unless this is the desired output by OP, which is not clear in this case.
df_update = pd.merge(df, df2, on='Fruit', how='inner', suffixes=['_1800h', '_1830h'])
Fruit Price_1800h Price_1830h
0 Apple 10 10.0
1 Banana 20 21.0
If Fruit is the index of your data frame the following code should work. The Idea is to return rows with inequality:
df['1800'] = df1['Price']
df['1830'] = df2['Price']
print(df.loc[df['1800'] != df1['1830']])
You can also use datetime in your column heading.
here is an example of my dataframe df:
Category Y X1 X2
0 Apple 0.083050996 0.164056482 0.519875358
1 Apple 0.411044939 0.774160332 0.002869499
2 Apple 0.524315907 0.422193005 0.97720091
3 Apple 0.721124638 0.645927536 0.750210715
4 Berry 0.134488729 0.299288214 0.522933484
5 Berry 0.733162132 0.608742944 0.957595544
6 Berry 0.113051075 0.641533175 0.19799635
7 Berry 0.275379123 0.249143751 0.049082766
8 Carrot 0.588121494 0.750480977 0.615399987
9 Carrot 0.878221581 0.021366296 0.069184879
Now I want the code to be able to do a regression for each Category (ie, cross sectional regression grouped by Category (for Apple, Berry and Carrot etc,)).
Then I want to add new columns df['Y_hat'] which has the forecast value from the regression, and the corresponding 2 beta and t-statistic values (beta and t-stat values would be the same for multiple rows of same category).
Final df would have 5 additional columns, Y_hat, beta 1, beta 2 , t-stat 1 and t-stat 2.
You want to do a lot of things for a "GroupBy" :)
I think is better if you slice the DataFrame by Category, then store each individual result for that category in a dictionary which you're going to use at the end of the loop to build your DataFrame.
result = {}
# loop on every category
for category in df['Category'].unique():
# slice
df_slice = df[df['Category'] == category]
# run all the stuff your want to do
result[category] = {
'predicted_value': ***,
'Y_hat': ***
'etc'
...
}
# build dataframe with all your results
final_df = pd.DataFrame(result)
Will be much easier if ever need to debug too! Good luck! :)