Counting rows that have same values in spcific columns in csv - python

I have a csv that i want to count how many rows that match specific columns, what would be the best way to do this? So for example if this was the csv:
fruit days characteristic1 characteristic2
0 apple 1 red sweet
1 orange 2 round sweet
2 pineapple 5 prickly sweet
3 apple 4 yellow sweet
the output i would want would be
1 apple: red,sweet

A csv is a file with values that are separated by commas. I would recommend turning this into a .txt file and using this same format. Then establish consistent spacing throughout your file (using tab for example). So that when you loop through each line it knows where the actual information is. Then when you what info is in what column you print those specific values.
# Use a tab in between each column
fruit days charac1 charac2
0 apple1 1 red sweet
1 orange 2 round sweet
2 pineapple 5 prickly sweet
3 apple 4 yellow sweet
This is just to get you started.

Related

Extraction of a common column pandas

I have two data frames, I need to extract data in Column_3 of the second dataframe DF2.
Question 1: How should I create "Column_3" from "Column_1" and "Column_2" of the first dataframe?
DF1 =
Column_1 Column_2 Column_3
Red Apple small Red Apple small
Green fruit Large Green fruit Large
Yellow Banana Medium Yellow Banana Medium
Pink Mango Tiny Pink Mango Tiny
Question 2: I need to extract "n_col3" from n_col1 & n_col2 but that should be similar to the column_3 of data frame 1. (see the brackets for info of what to be extracted)
Note: If all the information of Column_3 is not available in Column_1 & Column_2 like in Row 1 & Row 3, Only that information that is available should be extracted)
DF2 =
n_col1 n_col2 n_col3
L854 fruit Charlie Green LTD Large fruit Large(Green missing Fruit Large extracted)
Red alpha 8 small Tango G250 Apple Red Apple small(all information extracted)
Mk43 Mango Beta Tiny J448 T Mango Tiny(Pink missing, Mango Tiny is extracted)
M40 Yellow Medium Romeo Banana Yellow Banana Medium(all information extracted)
I want to extract that column so that I can do further processing of merging. Can anyone help me with this. Thank you in advance.

python Pandas: VLOOKUP multiple cells on column

I'm struggling with next task: I would like to identify using pandas (or any other tool on python) if any of multiple cells (Fruit 1 through Fruit 3) in each row from Table 2 contains in column Fruits of Table1. And at the end obtain "Contains Fruits Table 2?" table.
Fruits
apple
orange
grape
melon
Name
Fruit 1
Fruit 2
Fruit 3
Contains Fruits Table 2?
Mike
apple
Yes
Bob
peach
pear
orange
Yes
Jack
banana
No
Rob
peach
banana
No
Rita
apple
orange
banana
Yes
Fruits in Table 2 can be up to 40 columns. Number of rows in Table1 is about 300.
I hope it is understandable, and someone can help me resolve this.
I really appreciate the support in advance!
Try:
filter DataFrame to include columns that contain the word "Fruit"
Use isin to check if the values are in table1["Fruits"]
Return True if any of fruits are found
map True/False to "Yes"/"No"
table2["Contains Fruits Table 2"] = table2.filter(like="Fruit")
.isin(table1["Fruits"].tolist())
.any(axis=1)
.map({True: "Yes", False: "No"})
>>> table2
Name Fruit 1 Fruit 2 Fruit 3 Contains Fruits Table 2
0 Mike apple None None Yes
1 Bob peach pear orange Yes
2 Jack banana None None No
3 Rob peach banana None No
4 Rita apple orange banana Yes
​~~~

Using pandas, how can I sort a table on all values that contains a string element from a list of string elements?

I have a list of strings looking like this:
strings = ['apple', 'pear', 'grapefruit']
and I have a data frame containing id and text values like this:
id
value
1
The grapefruit is delicious! But the pear tastes awful.
2
I am a big fan og apple products
3
The quick brown fox jumps over the lazy dog
4
An apple a day keeps the doctor away
Using pandas I would like to create a filter which will give me only the id and values for those rows, which contain one or more of the values together with a column, showing which values are contained in the string, like this:
id
value
value contains substrings:
1
The grapefruit is delicious! But the pear tastes awful.
grapefruit, pear
2
I am a big fan og apple products
apple
4
An apple a day keeps the doctor away
apple
How would I write this using pandas?
Use .str.findall:
df['fruits'] = df['value'].str.findall('|'.join(strings)).str.join(', ')
df[df.fruits != '']
id value fruits
0 1 The grapefruit is delicious! But the pear tast... grapefruit, pear
1 2 I am a big fan og apple products apple
3 4 An apple a day keeps the doctor away apple

Iterate over a dataframe to print the index and column and value

First off, I am still new to Python and have searched and have been unable to find out anywhere how to do this (from a new person's perspective)...
I have a python
I need to print out the index, column name and value.
Let's say I have the following dataframe
EAT DAILY WEEKLY YEARLY
Fruit
APPLE 2 5 200
ORANGE 1 3 100
BANANA 1 4 150
PEAR 0 1 40
I need to print it our such that I would get something like the following so that it iterates over every row in the dataframe.
Eat Apple Daily at least 2
Eat Apple Weekly at least 5
Eat Apple Yearly at least 200
Eat Orange Daily at least 1
Eat Orange Weekly at least 3
Eat Orange Yearly at least 100
..
...
....
I have tried various combinations but am still learning so any help is appreciated.
So far I have tried
for row in test.iterrows():
index, data = row
print index , (data['column1'])
print index , (data['column2'])
print index , (data['column3'])
Which will give me the index and value but not the column plus I'd like it to be able to iterate regardless how many columns or rows were used. Also, I still need to be able to insert the text which needs to be dynamic...
Series of strings
f = 'Eat {Fruit} {EAT} at least {value}'.format
df.stack().reset_index(name='value').apply(lambda x: f(**x), 1)
0 Eat APPLE DAILY at least 2
1 Eat APPLE WEEKLY at least 5
2 Eat APPLE YEARLY at least 200
3 Eat ORANGE DAILY at least 1
4 Eat ORANGE WEEKLY at least 3
5 Eat ORANGE YEARLY at least 100
6 Eat BANANA DAILY at least 1
7 Eat BANANA WEEKLY at least 4
8 Eat BANANA YEARLY at least 150
9 Eat PEAR DAILY at least 0
10 Eat PEAR WEEKLY at least 1
11 Eat PEAR YEARLY at least 40
dtype: object
print out
for idx, value in df.stack().iteritems():
print('Eat {0[0]} {0[1]} at least {1}'.format(idx, value))
Eat APPLE DAILY at least 2
Eat APPLE WEEKLY at least 5
Eat APPLE YEARLY at least 200
Eat ORANGE DAILY at least 1
Eat ORANGE WEEKLY at least 3
Eat ORANGE YEARLY at least 100
Eat BANANA DAILY at least 1
Eat BANANA WEEKLY at least 4
Eat BANANA YEARLY at least 150
Eat PEAR DAILY at least 0
Eat PEAR WEEKLY at least 1
Eat PEAR YEARLY at least 40
You can use stack for reshape to Series with MultiIndex and then iterate by Series.iteritems with format:
test = test.stack()
print (test)
Fruit EAT
APPLE DAILY 2
WEEKLY 5
YEARLY 200
ORANGE DAILY 1
WEEKLY 3
YEARLY 100
BANANA DAILY 1
WEEKLY 4
YEARLY 150
PEAR DAILY 0
WEEKLY 1
YEARLY 40
dtype: int64
for index, data in test.iteritems():
print (('Eat {} {} at least {}').format(index[0], index[1], data))
Eat APPLE DAILY at least 2
Eat APPLE WEEKLY at least 5
Eat APPLE YEARLY at least 200
Eat ORANGE DAILY at least 1
Eat ORANGE WEEKLY at least 3
Eat ORANGE YEARLY at least 100
Eat BANANA DAILY at least 1
Eat BANANA WEEKLY at least 4
Eat BANANA YEARLY at least 150
Eat PEAR DAILY at least 0
Eat PEAR WEEKLY at least 1
Eat PEAR YEARLY at least 40
But if really need DataFrame add reset_indexand then loop by DataFrame.iterrows:
test = test.stack().reset_index(name='VAL')
print (test)
Fruit EAT VAL
0 APPLE DAILY 2
1 APPLE WEEKLY 5
2 APPLE YEARLY 200
3 ORANGE DAILY 1
4 ORANGE WEEKLY 3
5 ORANGE YEARLY 100
6 BANANA DAILY 1
7 BANANA WEEKLY 4
8 BANANA YEARLY 150
9 PEAR DAILY 0
10 PEAR WEEKLY 1
11 PEAR YEARLY 40
for index, data in test.iterrows():
print (('Eat {} {} at least {}').format(data['Fruit'], data['EAT'], data['VAL']))
Eat APPLE DAILY at least 2
Eat APPLE WEEKLY at least 5
Eat APPLE YEARLY at least 200
Eat ORANGE DAILY at least 1
Eat ORANGE WEEKLY at least 3
Eat ORANGE YEARLY at least 100
Eat BANANA DAILY at least 1
Eat BANANA WEEKLY at least 4
Eat BANANA YEARLY at least 150
Eat PEAR DAILY at least 0
Eat PEAR WEEKLY at least 1
Eat PEAR YEARLY at least 40
Consider even a non-loop solution using pandas.DataFrame.to_string:
sdf = df.stack().reset_index(name='VALUE')
sdf['Output'] = sdf.apply(lambda row: "EAT {} {} at least {}".\
format(row['Fruit'], row['EAT'], row['VALUE']), axis=1)
# PRINT TO CONSOLE
print(sdf[['Output']].to_string(header=False, index=False, justify='left'))
# WRITE TO TEXT
with open('Output.txt', 'w') as f:
f.write(sdf[['Output']].to_string(header=False, index=False, justify='left'))
# EAT APPLE DAILY at least 2
# EAT APPLE WEEKLY at least 5
# EAT APPLE YEARLY at least 200
# EAT ORANGE DAILY at least 1
# EAT ORANGE WEEKLY at least 3
# EAT ORANGE YEARLY at least 100
# EAT BANANA DAILY at least 1
# EAT BANANA WEEKLY at least 4
# EAT BANANA YEARLY at least 150
# EAT PEAR DAILY at least 0
# EAT PEAR WEEKLY at least 1
# EAT PEAR YEARLY at least 40
You will notice a justification issue currently a reported bug on the method. Of course you can remedy with string handling (strip(), replace()) in general, base Python.

Changing CSV files in python

I have a bunch of CSV files with 4 line headers. In these files, I want to change the values in the sixth column based on the values in the second column. For example, if the second column, under the name PRODUCT is Banana, I would want to change the value in the same row under TIME to 10m. If the the product was Apple I would want the time to be 15m and so on.
When 12:07
Area Produce
Store Name FF
Eatfresh
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 5m
2 Apple 400000 F4 8m
3 Pair 6m
4 Banana 4000 G3 7m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 6m
Desired Output
When 12:07
Area Produce
Store Name FF
Eatfresh
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 10m
2 Apple 400000 F4 15m
3 Pair 6m
4 Banana 4000 G3 10m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 15m
I want to output all of them to be outputed to a directory call NTime. Here is what I have thus far, but being new to coding, I don't really understand a great deal and have gotten stuck on how to make the actual changes. I found Python/pandas idiom for if/then/else and it seems similar to what I want to do, but I don't completely understand what is going on.
import pandas as pd
import glob
import os
fns = glob.glob('*.csv')
colname1 = 'PRODUCT'
colname2 = 'TIME'
for csv in fns:
s = pd.read_csv(csv, usecols=[colname1], squeeze=True, skiprows=4, header=0)
with open(os.path.join('NTime', fn), 'wb') as f:
Can someone help me?
You can do this with a combination of groupby, replace and a dict
In [76]: from pandas import DataFrame
In [77]: fruits = ['banana', 'apple', 'pear', 'banana', 'watermelon', 'orange', 'apple']
In [78]: times = ['5m', '8m', '6m', '7m', '13m', '2m', '6m']
In [79]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m'}
In [80]: df = DataFrame({'fruits': fruits, 'time': times})
Out[80]:
fruits time
0 banana 5m
1 apple 8m
2 pear 6m
3 banana 7m
4 watermelon 13m
5 orange 2m
6 apple 6m
In [81]: def replacer(g, time_map):
....: tv = g.time.values
....: return g.replace(to_replace=tv, value=time_map.get(g.name, tv))
In [82]: df.groupby('fruits').apply(replacer, time_map)
Out[82]:
fruits time
0 banana 10m
1 apple 15m
2 pear 5m
3 banana 10m
4 watermelon 13m
5 orange 2m
6 apple 15m
You said you're new to programming so I'll explain what's going on.
df.groupby('fruits') splits the DataFrame into subsets (which are DataFrames or Series objects) using the values of the fruits column.
The apply method applies a function to each of the aforementioned subsets and concatenates the result (if needed).
replacer is where the "magic" happens: each group's time values get replaced (to_replace) with the new value that's defined in time_map. The get method of dicts allows you to provide a default value if the key you're searching for (the fruit name in this case) is not there. nan is commonly used for this purpose, but here I'm actually just using the time that was already there if there isn't a new one defined for it in the time_map dict.
One thing to note is my use of g.name. This doesn't normally exist as an attribute on DataFrames (you can of course define it yourself if you want to), but is there so you can perform computations that may require the group name. In this case that's the "current" fruit you're looking at when you apply your function.
If you have a new value for each fruit or you write in the old values manually you can shorten this to a one-liner:
In [130]: time_map = {'banana': '10m', 'apple': '15m', 'pear': '5m', 'orange': '10m', 'watermelon': '100m'
}
In [131]: s = Series(time_map, name='time')
In [132]: s[df.fruits]
Out[132]:
fruits
banana 10m
apple 15m
pear 5m
banana 10m
watermelon 100m
orange 10m
apple 15m
Name: time, dtype: object
In [133]: s[df.fruits].reset_index()
Out[133]:
fruits time
0 banana 10m
1 apple 15m
2 pear 5m
3 banana 10m
4 watermelon 100m
5 orange 10m
6 apple 15m
Assuming that your data is in a Pandas DataFrame and looks something like this:
PN PRODUCT NUMBER INV ENT TIME
1 Banana 600000 10m
2 Apple 400000 F4 15m
3 Pair 6m
4 Banana 4000 G3 10m
5 Watermelon 700000 13m
6 Orange 12000 2m
7 Apple 1650000 15m
Then you should be able to do manipulate values in one column based on values in another column (same row) using simple loops like this:
for numi, i in enumerate(df["PRODUCT"]):
if i == "Banana":
df["TIME"][numi] = "10m"
if i == "Apple":
df["TIME"][numi] = "15m"
The code first loops through the rows of the dataframe column "PRODUCT", with the row value stored as i and the row-number stored as numi. It then uses if statements to identify the different levels of interest in the Product column. For those rows with the levels of interest (eg "Banana" or "Apple"), it uses the row-numbers to change the value of another column in the same row.
There are lots of ways to do this, and depending on the size of your data and the number of levels (in this case "Products") you want to change, this isn't necessarily the most efficient way to do this. But since you're a beginner, this will probably be a good basic way of doing it for you to start with.

Categories

Resources