I have two CSV with two column.
First CSV with 24K item
ProductId Value
XXX 5
XYZ 3
ZXX 7
KWQ 5.37
i have a second csv on the same way but with less product and not in order and some not include in the first one.
Second CSV 13K Item
ProductId Value
YYY 3
XXX 9
XYZ 0.01
I want to replace all value of the first list if Productid match between the list with the value column of the second one
i try with panda (something im not familiar at all) since im pretty noob in python...
i could get a match with .isin but my brain couldn't figure out the way to take the value of the second column of the 2.csv to replace the second value of the match in the 1.csv...
I use DataFrame in Python in order to manipulate CSV files. So I use things like df['column_name'].
but df doesn't seem to find the column in the file (writes a "KeyError"), even if there really is that column, and even if I checked it back to see if there was a mistake in the letters.
So if I want my program to work, and my CSV file to be readen by df and python, I need to manually change the name of the column that I want to manipulate before everything.
To explain the situation, the files I manipulate are not from me, they're pregenerated, and so it looks like python doesn't want to read it if I don't change the name of the column, because everything works after changing it.
I hope you have understood and you'll be able to help me !
Have you checked if the 'column_name' in file and code, both are in capital or both are in small letters? sometimes it gives error when one is in capital but another one is in small letters.
ok, I figured the thing out but I don't know how to deal with it:
I get that, from the file I want to manipulate, if I copy the column named "Time":
(Time
)
I added brackets just to show that it goes to the line right after, so it seems to be a problem from the original file, with the name of the column having a "enter" literaly in itself.
So it makes that for example in a code:
time = df['Time
']
It prevents the code from working.
I don't have any idea about how to deal with it, and I don't think I can fix it by fixing the column's name in the file because it is pregenerated.
have you checked for white spaces? like tabs or line breaks.
EDIT:
Now that you know the problem is a linebreak and that perhaps other columns on the data frame have the same problem, you can clean them like this
before:
df = pd.DataFrame([['A',1],
['B',2],
['C',3],
['D',4],
['E',5]], columns=['column 1 \n',' \n column2 \n'])
output:
column 1 \n \n column2 \n
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
After:
#cleaning the column names
new_columns = [i.strip() for i in df.columns]
df.columns = new_columns
column 1 column2
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
I need to find element in each column list based on part of element. a1 should find by a value.
contains work if column1 is a single string and in case of list in column1 not work
Original Data Frame:
Desire result:
here is the code i tried:
frame = pd.DataFrame({'column1' : [['a-1','b-1','c-1'], ['a-2','b-2','c-2'], ['a-3','b-3','c-3']]})
frame['column1']=frame[frame['column1'].str.contains('a')]
If the order of the list doesn't change, you can try something like
frame['column1'] = frame['column1'].str[0]
frame
Output
column1
0 a-1
1 a-2
2 a-3
I am working with a big dataset (more than 2 million rows x 10 columns) that has a column with string values that were filled oddly. Some rows start and end with many space characters, while others don't.
What I have looks like this:
col1
0 (spaces)string(spaces)
1 (spaces)string(spaces)
2 string
3 string
4 (spaces)string(spaces)
I want to get rid of those spaces at the beginning and at the end and get something like this:
col1
0 string
1 string
2 string
3 string
4 string
Normally, for a small dataset I would use a for iteration (I know it's far from optimal) but now it's not an option given the time it would take.
How can I use the power of pandas to avoid a for loop here?
Thanks!
edit: I can't get rid of all the whitespaces since the strings contain spaces.
df['col1'].apply(lambda x: x.strip())
might help
After some searching I seem to be coming up a bit blank. I'm also a total regex simpleton...
I have a csv file with data like this:
header1 header2
row1 "asdf (qwer) asdf"
row2 "asdf (hghg) asdf (lkjh)"
row3 "asdf (poiu) mkij (vbnc) yuwuiw (hjgk)"
I've put double quotes around the rows in header2 for clarity that it is one field.
I want to extract each occurrence of words between brackets (). There will be a least one occurrence per row, but I don't know ahead of time how many occurrences of bracketed words will appear in each line.
Using the wonderful https://www.regextester.com/ i think the regex i need is \(.*?\)
But I keep getting:
ValueError: pattern contains no capture groups
the code i used was:
pattern = r'\(.*?\)'
extracted = df.loc[:, 'header2'].str.extractall(pattern)
Any help appreciated.
thanks
You need to include a capture group inside the parenthesis. Also, when using extractall, I'd use unstack so it matches the structure of your DataFrame:
df.header2.str.extractall(r'\((.*?)\)').unstack()
0
match 0 1 2
0 qwer NaN NaN
1 hghg lkjh NaN
2 poiu vbnc hjgk
If you're concerned about performance, don't use pandas string operations:
pd.DataFrame([re.findall(r'\((.*?)\)', row) for row in df.header2])
0 1 2
0 qwer None None
1 hghg lkjh None
2 poiu vbnc hjgk